Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't parse large XML #150

Closed
kou opened this issue Jun 17, 2024 · 0 comments · Fixed by #154
Closed

Can't parse large XML #150

kou opened this issue Jun 17, 2024 · 0 comments · Fixed by #154

Comments

@kou
Copy link
Member

kou commented Jun 17, 2024

For example, we can't parse an XML of Wikipedia: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

REXML::ParseException: #<RangeError: integer 2147501889 too big to convert to 'int'>
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/source.rb:127:in 'StringScanner#pos='
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/source.rb:127:in 'REXML::Source#position='
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/baseparser.rb:447:in 'REXML::Parsers::BaseParser#pull_event'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/baseparser.rb:207:in 'REXML::Parsers::BaseParser#pull'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/streamparser.rb:20:in 'REXML::Parsers::StreamParser#parse'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:51:in 'block in Datasets::Wikipedia#each'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:78:in 'block (2 levels) in Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:56:in 'IO.pipe'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:56:in 'block in Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:55:in 'IO.pipe'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:55:in 'Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:71:in 'Datasets::Wikipedia#open_data'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:48:in 'Datasets::Wikipedia#each'
...
Exception parsing
Line: -1
Position: -1
Last 80 unconsumed characters:
/text>

(ruby -r datasets -e 'Datasets::Wikipedia.new.each {}' will reproduce this.)

We need to drop parsed content in StringScanner of REXML::Source to parse large XML.

naitoh added a commit to naitoh/rexml that referenced this issue Jun 19, 2024
naitoh added a commit to naitoh/rexml that referenced this issue Jun 20, 2024
naitoh added a commit to naitoh/rexml that referenced this issue Jun 22, 2024
…exceeds a certain size, have it removed.

See: ruby#150

---------

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
naitoh added a commit to naitoh/rexml that referenced this issue Jun 22, 2024
…exceeds a certain size, have it removed.

See: ruby#150

---------

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@kou kou closed this as completed in #154 Jun 22, 2024
kou added a commit that referenced this issue Jun 22, 2024
GitHub: fix GH-150

If a parsed XML is later than `2 ** 31 - 1`, we can't parse it. Because
`StringScanner`s position is stored as `int`. We can avoid the
restriction by dropping large parsed content.

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant