Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SAX style event parser (not just DOM) #824

Closed
burtonator opened this issue Feb 8, 2017 · 4 comments
Closed

Support SAX style event parser (not just DOM) #824

burtonator opened this issue Feb 8, 2017 · 4 comments
Milestone

Comments

@burtonator
Copy link

Hear me out... I know that JSoup is mostly DOM oriented but I'd like to recommend that we consider adding an optional SAX feature.

This is partially related to this issue:

#758

but I would like to expand on a description.

Right now the TreeBuilder functionality implements the parsing we need however there is no way to listen to the parse as it happens.

We have some of our queries that are better / faster if they are just implemented together as the document is parsed (rather than CSS selectors).

This way we could just elide the searching during the parse to avoid then having to follow up by recursing into the tree.

Perhaps "progress" can be implemented in the same framework:

#656

@jokester
Copy link

jokester commented Feb 24, 2017

Considering similar stuff here. Glad to know I'm not alone.

I experienced sub-optimal parsing performance a few days ago, when using jsoup in my android app.

I can think of a few scenario where SAX-like parser can be advantageous:

  • when the sole aim of parsing is to extract data from HTML and a complete DOM tree is not required.
  • when latency matters. A SAX parser can emit a portion of data immediately after it is found. This should play well with stream-ing paradigm like RxJava)
  • when calling Element.select() is too slow. If I read correctly, jsoup's selector may traverse whole subtree without much pruning.

I may be able to start doing something in days.

@burtonator
Copy link
Author

Hey Wang. We're also going to do some SAX vs JDOM performance benchmarking internally.

I think what would be nice is maybe some raw Element (without emiting parent/children)

@jokester
Copy link

jokester commented Mar 23, 2017

I have been trying to build a SAX-like parser from jsoup's existing code.

For known-to-be perfect HTMLs, I was able to construct partial DOM trees with a trivial state machine, and CSS-like selector like >html >body >div#main >ul.class-1 >li

However such a a simplistic SAX parser would have a very limited use for real-world HTMLs.
There is actually a complex rule set to parse real-world HTMLs, including how to recover from mismatched tags and so on. Adapting the rules to build a complete new standard-compliance SAX parser would be beyond my available time.

I'm considering another plan. See if I can change jsoup's DOM parser to (emit new Element upon found) and (be able to interrupt when wanted).

@jhy
Copy link
Owner

jhy commented Jul 1, 2024

Hi folks - I have implemented a StreamParser interface in #2096 which should address the needs discussed here. Would welcome your reviews and feedback. (And yes, I know it's seven years later... :) )

@jhy jhy closed this as completed Jul 1, 2024
@jhy jhy added this to the 1.18.1 milestone Jul 1, 2024
@jhy jhy added the improvement label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants