Can't parse large XML #150

kou · 2024-06-17T04:53:49Z

For example, we can't parse an XML of Wikipedia: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

REXML::ParseException: #<RangeError: integer 2147501889 too big to convert to 'int'>
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/source.rb:127:in 'StringScanner#pos='
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/source.rb:127:in 'REXML::Source#position='
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/baseparser.rb:447:in 'REXML::Parsers::BaseParser#pull_event'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/baseparser.rb:207:in 'REXML::Parsers::BaseParser#pull'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/rexml-3.3.0/lib/rexml/parsers/streamparser.rb:20:in 'REXML::Parsers::StreamParser#parse'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:51:in 'block in Datasets::Wikipedia#each'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:78:in 'block (2 levels) in Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:56:in 'IO.pipe'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:56:in 'block in Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:55:in 'IO.pipe'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/dataset.rb:55:in 'Datasets::Dataset#extract_bz2'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:71:in 'Datasets::Wikipedia#open_data'
/tmp/local/lib/ruby/gems/3.4.0+0/gems/red-datasets-0.1.7/lib/datasets/wikipedia.rb:48:in 'Datasets::Wikipedia#each'
...
Exception parsing
Line: -1
Position: -1
Last 80 unconsumed characters:
/text>

(ruby -r datasets -e 'Datasets::Wikipedia.new.each {}' will reproduce this.)

We need to drop parsed content in StringScanner of REXML::Source to parse large XML.

The text was updated successfully, but these errors were encountered:

…exceeds a certain size, have it removed. See: ruby#150

…exceeds a certain size, have it removed. See: ruby#150 --------- Co-authored-by: Sutou Kouhei <kou@clear-code.com>

GitHub: fix GH-150 If a parsed XML is later than `2 ** 31 - 1`, we can't parse it. Because `StringScanner`s position is stored as `int`. We can avoid the restriction by dropping large parsed content. Co-authored-by: Sutou Kouhei <kou@clear-code.com>

naitoh added a commit to naitoh/rexml that referenced this issue Jun 19, 2024

If the size of the content parsed by StringScanner to parse huge XML …

70bce7e

…exceeds a certain size, have it removed. See: ruby#150

naitoh mentioned this issue Jun 19, 2024

Fix a bug that a large XML can't be parsed #154

Merged

naitoh added a commit to naitoh/rexml that referenced this issue Jun 20, 2024

If the size of the content parsed by StringScanner to parse huge XML …

57fa969

…exceeds a certain size, have it removed. See: ruby#150

naitoh added a commit to naitoh/rexml that referenced this issue Jun 22, 2024

If the size of the content parsed by StringScanner to parse huge XML …

71a4f82

…exceeds a certain size, have it removed. See: ruby#150 --------- Co-authored-by: Sutou Kouhei <kou@clear-code.com>

naitoh added a commit to naitoh/rexml that referenced this issue Jun 22, 2024

If the size of the content parsed by StringScanner to parse huge XML …

758b6b4

…exceeds a certain size, have it removed. See: ruby#150 --------- Co-authored-by: Sutou Kouhei <kou@clear-code.com>

kou closed this as completed in #154 Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't parse large XML #150

Can't parse large XML #150

kou commented Jun 17, 2024

Can't parse large XML #150

Can't parse large XML #150

Comments

kou commented Jun 17, 2024