Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug that a large XML can't be parsed #154

Merged
merged 1 commit into from
Jun 22, 2024

Conversation

naitoh
Copy link
Contributor

@naitoh naitoh commented Jun 19, 2024

GitHub: fix GH-150

If a parsed XML is later than 2 ** 31 - 1, we can't parse it. Because StringScanners position is stored as int. We can avoid the restriction by dropping large parsed content.

@naitoh naitoh marked this pull request as draft June 20, 2024 01:23
lib/rexml/source.rb Show resolved Hide resolved
test/test_pullparser.rb Outdated Show resolved Hide resolved
@naitoh naitoh marked this pull request as ready for review June 20, 2024 14:30
@naitoh naitoh requested a review from kou June 20, 2024 14:30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to test/parser/test_base_parser.rb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK
I see.


module REXMLTests
class BaseParserTester < Test::Unit::TestCase
include REXML
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needless.

Suggested change
include REXML

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK
I see.

test/test_baseparser.rb Outdated Show resolved Hide resolved
parser = REXML::Parsers::BaseParser.new(xml)
while parser.has_next?
parser.pull
assert_compare REXML::Source::SCANNER_RESET_SIZE + N_STRING.size, ">", parser.position
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!! Can we access SCANNER_REST_SIZE via REXML::Source even when we use private_constant!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

    20| require 'debug'
    21|     def test_parse_large_xml
    22|       xml = build_xml(N_ELEMENTS)
    23|       parser = REXML::Parsers::BaseParser.new(xml)
=>  24|       debugger
    25|       while parser.has_next?
    26|         parser.pull
    27|         assert_compare REXML::Source::SCANNER_RESET_SIZE + N_STRING.size, ">", parser.position
    28|       end
=>#0	REXMLTests::BaseParserTester#test_parse_large_xml at ~/ghq/github.com/naitoh/rexml/test/parser/test_base.rb:24
  #1	Test::Unit::TestCase#run_test at ~/.rbenv/versions/3.3.2/lib/ruby/gems/3.3.0/gems/test-unit-3.6.1/lib/test/unit/testcase.rb:871
  # and 28 frames (use `bt' command for all frames)
(ruby) puts REXML::Source::SCANNER_RESET_SIZE
100000
nil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... It seems that we should remove include Private (in a separated PR)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...

test/test_baseparser.rb Outdated Show resolved Hide resolved
…exceeds a certain size, have it removed.

See: ruby#150

---------

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@naitoh naitoh requested a review from kou June 22, 2024 01:14
@kou kou changed the title If the size of the content parsed by StringScanner to parse huge XML exceeds a certain size, have it removed. Fix a bug that a large XML can't be parsed Jun 22, 2024
@kou kou merged commit 4c28808 into ruby:master Jun 22, 2024
61 checks passed
@kou
Copy link
Member

kou commented Jun 22, 2024

Thanks.

@naitoh naitoh deleted the drop_parsed_content branch June 22, 2024 03:01
naitoh added a commit to naitoh/rexml that referenced this pull request Jun 22, 2024
kou pushed a commit that referenced this pull request Jun 22, 2024
Included constants are not private. So private constants in private
module aren't private.

See also: #154 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Can't parse large XML
2 participants