-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ElementParser
and probably slightly increase performance
#754
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #754 +/- ##
==========================================
+ Coverage 61.24% 61.83% +0.58%
==========================================
Files 39 41 +2
Lines 16277 16720 +443
==========================================
+ Hits 9969 10338 +369
- Misses 6308 6382 +74
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
0b00592
to
5a03d13
Compare
src/reader/mod.rs
Outdated
/// Returns a slice of data read up to end of processing instruction (`>`), | ||
/// which does not include into result (`?` at the end included). | ||
/// Returns a slice of data read up to end of a chunk, which does not include | ||
/// into result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with
Returns a slice of data read up to the end of a chunk, which is not included in the result
But it also now sounds a bit overgeneralized. It's not clear what constitutes "the end of a chunk" - that depends on the type of parser obviously, but it could be clearer.
Maybe instead of "the end of a chunk" say "the end of the {object / article / a synonym that isn't element or entity since they already mean something in XML} being parsed"
I'm just giving suggestions, feel free to adjust if you see fit
src/reader/mod.rs
Outdated
/// | ||
/// If input (`Self`) is exhausted and nothing was read, returns `None`. | ||
/// If input (`Self`) is exhausted and nothing was read, returns `SyntaxError`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"nothing was read" is likewise not very specific without the context that was removed. Whichever thing this parser was trying to read, was not read completely.
src/reader/slice_reader.rs
Outdated
@@ -275,8 +275,8 @@ impl<'a> XmlSource<'a, ()> for &'a [u8] { | |||
} | |||
} | |||
|
|||
fn read_pi(&mut self, _buf: (), position: &mut usize) -> Result<&'a [u8]> { | |||
let mut parser = PiParser::default(); | |||
fn read<P: Parser>(&mut self, _buf: (), position: &mut usize) -> Result<&'a [u8]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read<P: Parser>()
is a very general name, perhaps overly so - how will this type of pattern work with TextEvent
s? Do you plan on a TextParser
? If it only applies to a few specific circumstances, then read()
is an awkward name for it .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not known yet is there will be TextParser
or not. I use "parser" structs to store state which is required to correctly understand where the corresponding event ends. Right now when we parse text there is no need to store any state, the first <
starts markup. This probably will be changed, because entity references (&something;
) can insert XML nodes into the text (that is why they are called "parsed entities" in the specification, because the result of their resolution is a text which is parsed into XML node) and it seems that for correct parsing we should resolve them immediately when their are encountered.
For now this is a my experimentation to modularize parser to reuse parts of them in DTD parsing. So the selected name could be a bit random, although I think, I can name it read_with
which would even look more good from grammatical point of view.
src/reader/element.rs
Outdated
/// After successful search the parser will return [`Some`] with position of | ||
/// found symbol. If search is unsuccessful, a [`None`] will be returned. You | ||
/// typically would expect positive result of search, so that you should feed | ||
/// new data until yo'll get it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// new data until yo'll get it. | |
/// new data until you get it. |
I quite like the idea of separating parsing from IO to the greatest extent possible, perhaps it will allow simplifying the duplication (and macros) caused by the async / sync split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my thoughts too. I also have implemented CommentParser
and CDataParser
which are looked for -->
and ]]>
correspondently (you can see them in #690) and now trying to integrate them into the parsing loop without performance drawdown if possible (I found that this was hard to do).
…ctions The parser search the end of processing instruction and this is the last byte of it
Return error from read_pi function instead of returning flag and later converting it to the error
(Review in whitespace changes ignored mode)
(Review in whitespace changes ignored mode)
@dralley, I hope I addressed all your comments, please look again. I also realized that requiring |
They are identical except different type of parser used.
Related: tafia#678 All methods called only once or two and inlining them in most cases increases performance of our benchmarks: > critcmp master element-parser -t 5 group element-parser master ----- -------------- ------ NsReader::read_resolved_event_into/trim_text = false 1.00 398.9±6.30µs ? ?/sec 1.05 419.6±7.94µs ? ?/sec NsReader::read_resolved_event_into/trim_text = true 1.00 382.1±7.06µs ? ?/sec 1.06 404.0±7.44µs ? ?/sec One event/CData 1.00 56.3±0.97ns ? ?/sec 1.21 68.1±1.35ns ? ?/sec One event/Comment 1.00 141.2±2.52ns ? ?/sec 1.14 161.4±2.79ns ? ?/sec decode_and_parse_document_with_namespaces/rpm_filelists.xml 1.00 95.1±1.45µs 115.5 MB/sec 1.07 102.2±1.65µs 107.5 MB/sec escape_text/escaped_chars_long 1.42 1806.4±34.20ns ? ?/sec 1.00 1275.0±23.98ns ? ?/sec escape_text/escaped_chars_short 1.00 491.5±8.35ns ? ?/sec 1.07 526.6±10.80ns ? ?/sec escape_text/no_chars_to_escape_long 2.06 1831.1±36.31ns ? ?/sec 1.00 887.1±17.00ns ? ?/sec parse_document_nocopy_with_namespaces/libreoffice_document.fodt 1.00 507.2±8.56µs 107.6 MB/sec 1.08 546.2±10.20µs 100.0 MB/sec parse_document_nocopy_with_namespaces/rpm_filelists.xml 1.00 87.2±1.64µs 126.0 MB/sec 1.14 99.2±1.74µs 110.7 MB/sec parse_document_nocopy_with_namespaces/rpm_other.xml 1.00 139.6±2.83µs 158.5 MB/sec 1.07 148.7±2.71µs 148.9 MB/sec parse_document_nocopy_with_namespaces/rpm_primary.xml 1.00 190.5±3.43µs 106.4 MB/sec 1.09 207.9±3.79µs 97.5 MB/sec parse_document_nocopy_with_namespaces/rpm_primary2.xml 1.00 61.7±1.10µs 116.2 MB/sec 1.09 67.5±1.28µs 106.2 MB/sec parse_document_nocopy_with_namespaces/sample_1.xml 1.00 10.5±0.20µs 105.0 MB/sec 1.06 11.1±0.21µs 99.3 MB/sec parse_document_nocopy_with_namespaces/sample_ns.xml 1.00 8.4±0.16µs 86.5 MB/sec 1.08 9.0±0.18µs 80.0 MB/sec parse_document_nocopy_with_namespaces/sample_rss.xml 1.00 786.4±13.46µs 239.8 MB/sec 1.09 859.9±12.82µs 219.3 MB/sec parse_document_nocopy_with_namespaces/test_writer_ident.xml 1.00 29.0±0.55µs 146.4 MB/sec 1.06 30.8±0.55µs 138.0 MB/sec read_event/trim_text = false 1.00 199.3±3.59µs ? ?/sec 1.10 218.5±3.98µs ? ?/sec read_event/trim_text = true 1.00 190.4±3.76µs ? ?/sec 1.11 211.7±4.11µs ? ?/sec unescape_text/no_chars_to_unescape_short 1.00 11.8±0.21ns ? ?/sec 1.06 12.4±0.23ns ? ?/sec
This is follow-up PR to #753 which polish API of introduced
PiParser
(which could be used to parse DTD) and add the similar parser to handle elements (which also could be used to parse<!ELEMENT>
and<!ATTLIST>
DTD tags). Also, this PR reduces amount of duplicated code and probably slightly increases performance as showed by benchmarks (master -- 385a1f8). However, in some tests there is a deviation of > = 5% even on consecutive runs on master, so most likely the acceleration is quite insignificant.escape_test
are also quite sensitive to noise, which reaches 15% there.Full diff:
Only >=5% diff:
escape_text
results floats even at masterThis is the last PR before release
0.32.0
.