Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformers memory leak? #129

Closed
fabn opened this issue Feb 21, 2015 · 7 comments
Closed

Transformers memory leak? #129

fabn opened this issue Feb 21, 2015 · 7 comments
Labels

Comments

@fabn
Copy link

fabn commented Feb 21, 2015

I'm using your gem to fix a very badly formatted markup. I'm running the cleaning process in a sidekiq worker and after a while process memory grows with no limit. I (think I have) tracked down the leak to Sanitize.fragment call. I made a simple script which should reproduce the issue.

200_000.times do |i|
  Sanitize.fragment(raw_markup, Sanitize::Config::WIKIPEDIA)
  if i % 10 == 0
    GC.start
    ending = GC.stat
    puts "#{i} Referenced objects count is: #{ending[:total_allocated_object] - ending[:total_freed_object]}"
    puts "#{i} Current used memory is #{`ps -o rss= -p #{Process.pid}`.to_i / 1024}MB"
    puts ObjectSpace.count_objects
    puts "\n"
  end
end

If i run this code on my local machine at iteration 3260 used memory is 813MB and it keeps growing, while other dumped params don't change across iterations. If I use one of the predefined configurations it doesn't happen 1.

Here is my sanitize config

class Sanitize
  module Config
    # Very restrictive config for wikipedia markup sanitizer
    WIKIPEDIA = freeze_config(
        elements: %w[b em i strong u br ol ul li pre p h1 h2 h3 h4 h5 h6],
        remove_contents: %w[sup table],
        # Transformers section
        transformers: [
            # Remove wikipedia edit links
            lambda { |env|
              # Completely remove span.editsection nodes
              if env[:node_name] == 'span' && /editsection/ =~ env[:node][:class]
                env[:node].unlink
              end
            },
            # Remove wikipedia thumb divs i.e. div with some graphic and caption
            lambda { |env|
              if env[:node_name] == 'div' && /thumb/ =~ env[:node][:class]
                env[:node].unlink
              end
            },
        ]
    )
  end
end

Here are my current bundle (relevant gems)

Gems included by the bundle:
  * crass (1.0.1)
  * nokogiri (1.6.6.2)
  * nokogumbo (1.2.0)
  * sanitize (3.1.1)

Full code including the html sample used is available here: https://gist.github.com/fabn/304691f9bd81e1b814cf

(1): Actually memory usage grows also with RELAXED config but very very slowly (i.e. after 4000 iterations used memory is ~70MB while value reported at first iteration was 25MB)

@rgrove
Copy link
Owner

rgrove commented Feb 21, 2015

Thanks for the report!

After some initial investigation, it seems like the leak occurs when certain nodes are unlinked, and it appears to be related to Sanitize's traverse method. When I swap that out with Nokogiri's (flawed) traverse method, the leak disappears.

There's nothing obvious in Sanitize's traverse method that seems like it could be leaking memory, but it's possible it could be triggering a leak in Nokogiri. I'll need to dig further.

@rgrove rgrove added the bug label Feb 21, 2015
rgrove added a commit that referenced this issue Feb 22, 2015
This triggers a memory leak in Nokogiri.

Fixes #129.
@rgrove
Copy link
Owner

rgrove commented Feb 22, 2015

@fabn I think I've fixed this. Can you try out the dev-3.1.2 branch (or install this pre-release gem) and let me know if it fixes the leak for you?

@fabn
Copy link
Author

fabn commented Feb 22, 2015

Thanks for your speedy support.

I tried the same code with dev-3.1.2 branch and it's much better. To give you some numbers at the same iteration I reported earlier memory usage went down from 813MB (in 3.1.1) to 167MB with 3.1.2. So your change makes a great improvement.

However I still see memory growing during time, so there should be some leak in nokogiri itself. I found for instance this report: sparklemotion/nokogiri#1063

@rgrove
Copy link
Owner

rgrove commented Feb 22, 2015

Looks like Nokogumbo may be the culprit this time. I can reproduce the leak with this loop, which uses Nokogumbo's HTML5 parser in the same way that Sanitize.fragment does:

require 'nokogumbo'

200_000.times do |i|
  doc = Nokogiri::HTML5.parse("<html><body>#{raw_markup}")
  frag = doc.fragment
  frag << doc.xpath('/html/body/node()')
end

However, this loop, which uses Nokogiri's libxml2 parser, doesn't create a leak:

require 'nokogumbo'

200_000.times do |i|
  doc = Nokogiri::HTML.parse("<html><body>#{raw_markup}")
  frag = doc.fragment
  frag << doc.xpath('/html/body/node()')
end

I'll take a look at Nokogumbo and see if I can spot the problem.

@rgrove
Copy link
Owner

rgrove commented Feb 22, 2015

@fabn My C-fu isn't strong enough to figure out where this is happening in Nokogumbo, but I've filed an upstream issue for this: rubys/nokogumbo#20

I'll go ahead and close this bug now since the original transformer-related leak is resolved. We can use the Nokogumbo bug to discuss the second leak. Thanks again for the reports and the helpful test cases!

@rgrove rgrove closed this as completed Feb 22, 2015
@fabn
Copy link
Author

fabn commented Feb 22, 2015

Ok, thanks for your help. I'll follow rubys/nokogumbo#20

Just one more question, will you release 3.1.2 or wait for that issue being solved?

@rgrove
Copy link
Owner

rgrove commented Feb 22, 2015

Just released 3.1.2. Enjoy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants