Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML SAX push parser leaks memory on JRuby, but not on MRI #998

Closed
Burgestrand opened this issue Nov 5, 2013 · 5 comments
Closed

XML SAX push parser leaks memory on JRuby, but not on MRI #998

Burgestrand opened this issue Nov 5, 2013 · 5 comments

Comments

@Burgestrand
Copy link

Hay!

I’m reading streaming XML from a remote server in JRuby using the SAX push parser in Nokogiri. It starts with an opening <events> tag, and then follows with a lot of streaming XML <event>data</event> coming in.

The following code leaks memory in JRuby (attached visualvm heap screenshot), but not in MRI.

Me and @jnicklas tracked it down to two internal variables in nokogiri.internals.NokogiriHandler which have content added to them for every new element, that is never released:

lines.add(locator.getLineNumber());

columns.add(locator.getColumnNumber() - 1); // libxml counts from 0 while java does from 1

require "nokogiri"

class Parser < Nokogiri::XML::SAX::Document
  def initialize
    @parser = Nokogiri::XML::SAX::PushParser.new(self)
  end

  def <<(data)
    @parser << data
  end

  def start_element(name, attrs)
  end

  def end_element(name)
  end
end

puts "Open up your GC monitor. Press enter to start."
gets

parser = Parser.new

parser << "<events>"
event = "<event></event>"
loop { parser << event }

heap

@yokolet
Copy link
Member

yokolet commented Nov 5, 2013

Please use parse.finish like in the example http://nokogiri.org/Nokogiri/XML/SAX/PushParser.html .

@knu
Copy link
Member

knu commented Nov 6, 2013

@yokolet I think the point made here is that a streaming parser engine is not expected to consume memory proportional to the amount of total data processed.

@Burgestrand
Copy link
Author

@yokolet @knu is correct. I do not expect the streaming parser to use an unbound amount of memory if the SAX handler is not retaining any data.

The data I am parsing is never ending, so there is no natural place for me to call #finish. I would like it to run for days, weeks, happily parsing data without using more and more memory for every minute it runs.

@quoideneuf
Copy link
Contributor

@Burgestrand
In case this is of any use, here's an example here of a workaround to them problem you're seeing (I think!):
https://github.com/archivesspace/archivesspace/blob/master/migrations/lib/xml_sax.rb#L88-90

@eljojo
Copy link

eljojo commented Nov 28, 2013

Hi, I'm seeing a leak very similar (jruby as well), but this time when using Nokogiri::XML::Reader (which AFAIK is impossible to use parse.finish).

My graphs show the same leak.
This should be enough to reproduce it.

reader = Nokogiri::XML::Reader(File.open('hugefile.xml'))
reader.each { |node| node.name }

This is what the eclipse memory analyzer shows:
screen shot 2013-11-28 at 6 46 42

I assume this line is the responsible for the leak, but it's just a guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants