-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SAX style event parser (not just DOM) #824
Comments
Considering similar stuff here. Glad to know I'm not alone. I experienced sub-optimal parsing performance a few days ago, when using jsoup in my android app. I can think of a few scenario where SAX-like parser can be advantageous:
I may be able to start doing something in days. |
Hey Wang. We're also going to do some SAX vs JDOM performance benchmarking internally. I think what would be nice is maybe some raw Element (without emiting parent/children) |
I have been trying to build a SAX-like parser from jsoup's existing code. For known-to-be perfect HTMLs, I was able to construct partial DOM trees with a trivial state machine, and CSS-like selector like However such a a simplistic SAX parser would have a very limited use for real-world HTMLs. I'm considering another plan. See if I can change jsoup's DOM parser to (emit new Element upon found) and (be able to interrupt when wanted). |
Hi folks - I have implemented a StreamParser interface in #2096 which should address the needs discussed here. Would welcome your reviews and feedback. (And yes, I know it's seven years later... :) ) |
Hear me out... I know that JSoup is mostly DOM oriented but I'd like to recommend that we consider adding an optional SAX feature.
This is partially related to this issue:
#758
but I would like to expand on a description.
Right now the TreeBuilder functionality implements the parsing we need however there is no way to listen to the parse as it happens.
We have some of our queries that are better / faster if they are just implemented together as the document is parsed (rather than CSS selectors).
This way we could just elide the searching during the parse to avoid then having to follow up by recursing into the tree.
Perhaps "progress" can be implemented in the same framework:
#656
The text was updated successfully, but these errors were encountered: