-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use more StringScanner based API to parse XML #114
Use more StringScanner based API to parse XML #114
Conversation
c9a3bf2
to
8a24e85
Compare
I changed to that style and modified the whole REXML::Parsers::BaseParser#pull_event to use StringScanner.
The format For this reason, I do not use
|
I think that |
I think that |
(I'll review implementation later...) |
I found a workaround with |
I found a workaround with "".freeze. |
8a24e85
to
a7c9491
Compare
## Why? Because `s.check("a")` is slower than `s.check("a".freeze)`. - benchmark/stringscan_2.yaml ``` loop_count: 100000 contexts: - name: No YJIT prelude: | $LOAD_PATH.unshift(File.expand_path("lib")) require 'rexml' prelude: | require 'strscan' s = StringScanner.new('abcdefg hijklmn opqrstu vwxyz') ptn = "a" benchmark: 'check("a")' : s.check("a") 'check("a".freeze)' : s.check("a".freeze) 'ptn="a";s.check(ptn)' : | ptn="a" s.check(ptn) 'check(ptn)' : s.check(ptn) ``` ``` $benchmark-driver benchmark/stringscan_2.yaml Comparison: check(ptn): 13524479.4 i/s check("a".freeze): 13433638.1 i/s - 1.01x slower check("a"): 10231225.8 i/s - 1.32x slower ptn="a";s.check(ptn): 10013017.0 i/s - 1.35x slower ```
a7c9491
to
9a075bf
Compare
--- a/lib/rexml/parsers/baseparser.rb (9a075bf)
+++ b/lib/rexml/parsers/baseparser.rb (after)
@@ -373,7 +373,7 @@ module REXML
return process_instruction
else
# Get the next tag
- md = @source.match(TAG_MATCH, true)
+ md = @source.match(/((?>#{QNAME_STR}))/umo, true)
unless md
@source.string = "<" + @source.buffer
raise REXML::ParseException.new("malformed XML: missing tag start", @source)
|
In 8d7fc13, updated handling of |
8d7fc13
to
bc05d26
Compare
Fixed what was pointed out. |
…lar expression to processing using StringScanner. ## Why Improve maintainability by optimizing the process so that the parsing process proceeds using StringScanner#scan. # Changed - Added Source#string= method for error message output. - Added TestParseDocumentTypeDeclaration#test_no_name test case. - Of the `intSubset` of DOCTYPE, "<!" added consideration for processing `Comments` that begin with "<!". [intSubset Spec] https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-doctypedecl > [28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-intSubset > [28b] intSubset ::= (markupdecl | DeclSep)* https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-markupdecl > [29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-elementdecl > [45] elementdecl ::= '<!ELEMENT' S Name S contentspec S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-AttlistDecl > [52] AttlistDecl ::= '<!ATTLIST' S Name AttDef* S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EntityDecl > [70] EntityDecl ::= GEDecl | PEDecl > [71] GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>' > [72] PEDecl ::= '<!ENTITY' S '%' S Name S PEDef S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-NotationDecl > [82] NotationDecl ::= '<!NOTATION' S Name S (ExternalID | PublicID) S? '>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI > [16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Comment > [15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->' https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-DeclSep > [28a] DeclSep ::= PEReference | S https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PEReference > [69] PEReference ::= '%' Name ';' [Benchmark] ``` RUBYLIB= BUNDLER_ORIG_RUBYLIB= /Users/naitoh/.rbenv/versions/3.3.0/bin/ruby -v -S benchmark-driver /Users/naitoh/ghq/github.com/naitoh/rexml/benchmark/parse.yaml ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin22] Calculating ------------------------------------- before after before(YJIT) after(YJIT) dom 11.240 10.569 17.173 18.219 i/s - 100.000 times in 8.896882s 9.461267s 5.823007s 5.488884s sax 31.812 30.716 48.383 52.532 i/s - 100.000 times in 3.143500s 3.255655s 2.066861s 1.903600s pull 36.855 36.354 56.718 61.443 i/s - 100.000 times in 2.713300s 2.750693s 1.763099s 1.627523s stream 34.176 34.758 49.801 54.622 i/s - 100.000 times in 2.925991s 2.877065s 2.008003s 1.830779s Comparison: dom after(YJIT): 18.2 i/s before(YJIT): 17.2 i/s - 1.06x slower before: 11.2 i/s - 1.62x slower after: 10.6 i/s - 1.72x slower sax after(YJIT): 52.5 i/s before(YJIT): 48.4 i/s - 1.09x slower before: 31.8 i/s - 1.65x slower after: 30.7 i/s - 1.71x slower pull after(YJIT): 61.4 i/s before(YJIT): 56.7 i/s - 1.08x slower before: 36.9 i/s - 1.67x slower after: 36.4 i/s - 1.69x slower stream after(YJIT): 54.6 i/s before(YJIT): 49.8 i/s - 1.10x slower after: 34.8 i/s - 1.57x slower before: 34.2 i/s - 1.60x slower ``` - YJIT=ON : 1.06x - 1.10x faster - YJIT=OFF : 0.94x - 1.01x faster Co-authored-by: Sutou Kouhei <kou@clear-code.com>
bc05d26
to
54b0298
Compare
Fixed what was pointed out. |
Thanks! |
## Why? ruby#114 (comment) > I want to just change scan pointer (StringScanner#pos=) instead of changing @scanner.string.
## Why? ruby#114 (comment) > I want to just change scan pointer (StringScanner#pos=) instead of changing @scanner.string.
## Why? We want to just change scan pointer. ruby#114 (comment) > I want to just change scan pointer (StringScanner#pos=) instead of changing @scanner.string.
## Why? We want to just change scan pointer. #114 (comment) > I want to just change scan pointer (`StringScanner#pos=`) instead of changing `@scanner.string`.
Why?
Improve maintainability by optimizing the process so that the parsing process proceeds using StringScanner#scan.
Changed
REXML::Parsers::BaseParser
fromfrozen_string_literal: false
tofrozen_string_literal: true
.Source#string=
method for error message output.intSubset
of DOCTYPE, "<!" added consideration for processingComments
that begin with "<!".[Benchmark]