New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Use parse5 as a default parser (closes #863) #985

Merged

fb55 merged 5 commits into cheeriojs:master from inikulin:parse5

Dec 20, 2020

Contributor

inikulin commented Feb 21, 2017 •

edited

Loading

This PR makes parse5 a default parser for cheerio. If any parsing options are set to non-default values cheerio fallbacks to htmlparser2. Additionally, useHtmlParser2 option was introduced to force usage of htmlparser2.

closes Make it possible to use parse5 as optional(?) parser #863
closes Greater than followed by equal sign is lost #860
closes Parse failure #126 (seems like was already fixed in htmlparser2, btw)
closes Recusion in parse.js: Maximum call stack size exceeded #240 (not related to this issue, but was fixed by removing recursive node decoration)
closes Loading https://support.microsoft.com/ using Cheerio dom manipulation not working properly. #405
closes <td> tags with no <tr> are not properly handled #422
closes Less-than character in text node breaks parser #522
closes incorrect html4 parsing #694 (wasn't a bug, htmlparser2 handled it correctly)
closes Unexpected behaviour with custom tag "line" #746
closes Cheerio failing with lax HTML structure? #826
closes Fails to parse datauris properly #907
closes Incorrect innerHTML of <xmp> tag #915
closes text function duplicates characters when it ends in &# ampersand hash #937
closes childCombinator does breaks when there is an unclosed tag #971
closes pre tag parsing bugs #997

coveralls commented Feb 21, 2017

Coverage increased (+0.006%) to 98.956% when pulling 0ac3996 on inikulin:parse5 into 51e3645 on cheeriojs:master.

1 similar comment

coveralls commented Feb 21, 2017 •

edited

Loading

Coverage increased (+0.006%) to 98.956% when pulling 0ac3996 on inikulin:parse5 into 51e3645 on cheeriojs:master.

inikulin changed the title ~~Use parse5 as a default parser (closes #863)~~ [WIP] Use parse5 as a default parser (closes #863)

Contributor Author

inikulin commented Feb 25, 2017 •

edited

Loading

Please, do not merge yet. I'll go through open issues and reference the ones that will be closed by this PR along with regression tests.

Update: done (see OP post). \cc @jugglinmike @fb55


          Use parse5 as a default parser (closes cheeriojs#863)

4fa8547

inikulin force-pushed the parse5 branch from 0ac3996 to 4fa8547 Compare

February 25, 2017 13:44

coveralls commented Feb 25, 2017

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

4 similar comments

coveralls commented Feb 25, 2017

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

coveralls commented Feb 25, 2017

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

coveralls commented Feb 25, 2017

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

coveralls commented Feb 25, 2017

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

Member

jugglinmike commented Feb 25, 2017

This is awesome, thank you! I will give it a look as soon as I am able.

inikulin changed the title ~~[WIP] Use parse5 as a default parser (closes #863)~~ Use parse5 as a default parser (closes #863)

tommaton commented Feb 27, 2017 •

edited

Loading

Hey guys, I know everyone on is very busy, but wondering if there any timescale on when this will be merged in as would really like to start using parse5 with cheerio in a new project.

Member

jugglinmike commented Feb 27, 2017

Not currently, no. Please remember that this is a non-trivial change and calls for careful review.

fb55 requested changes

View reviewed changes

Member

fb55 left a comment

Looking good so far!

Readme.md

@@ @@ -164,14 +164,15 @@ $ = cheerio.load('<ul id="fruits">...</ul>', { @@
               });
               ```
-              These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. The default options are:
+              These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. If any of these options is set to non-default value cheerio will implicitly use `htmlparser2` as an underlying parser. In addition, you can use `useHtmlParser2` option to force cheerio use `htmlparser2` instead of `parse5`. The default options are:
               ```js
               {
                   withDomLvl1: true,

Member

fb55 Feb 27, 2017

Is withDomLvl1 implemented in the parse5 tree adapter?

Member

jugglinmike Mar 5, 2017

I had the same question, so I experimented a bit and found that Dom Level 1 support is analogous. Here's the relevant source code. Great!

lib/parse.js Outdated

                 // Update the dom using the root
                 exports.update(dom, root);
                 return root;
               };
+              function parseWithParse5 (content) {
+                var parseAsDocument = /^(\s|<!--.*?-->)*?<(!doctype|html|head|body)(.*?)>/i.test(content),

Member

fb55 Feb 27, 2017

This is a smart hack, I'm not sure if it's expected though. IMHO we should only parse fragments right now and switch to full parsing in cheerio@1.0

Contributor Author

inikulin Feb 27, 2017 •

edited

Loading

If we'll always use fragment parsing parser will omit <html>, <body> and <head> tags and doctypes, so it will basically will return empty AST for full documents. On the other hand, enabling full parsing will always add these tags. Even in 1.0 you will still need this hack.

Contributor Author

inikulin Feb 27, 2017 •

edited

Loading

Speaking clearly, it just enables the behavior that we currently have in cheerio. Unless we'll not add explicit API for document and fragment parsing, but as far as I can tell, it will lead to breaking change.

Member

fb55 commented Feb 27, 2017 via email •

edited

Loading

Okay, I didn't realize that's the case. My preferred final behavior would be to have .load load documents and $() parse fragments. That suggestion was only to keep backwards compatibility for now. Given that that won't be possible, we might consider a semver major bump for this. At the same time, we could also adopt the parse5 DOM structure and add hooks for the old DOM structure (getters for type, name etc.) for backwards compatibility, with eventually an option to turn it off for better performance.

Contributor Author

inikulin commented Feb 27, 2017 •

edited

Loading

@fb OK, let's stick with this plan:

use full-parsing for .load()
use fragment parsing for $
switch to parse5 AST
add optional node decorators for backward compatibility

Contributor Author

inikulin commented Feb 27, 2017 •

edited

Loading

I've just realized that it will not be possible to switch to parse5 AST, since we use htmlparser2 for XML.

Member

jugglinmike commented Mar 5, 2017

Let me start be re-iterating: this is really great work! You've clearly done your homework when it comes to Cheerio's known bugs and backwards compatibility.

This is a significant change to the core behavior of the library. While the value is clear, it's also a risk. Our test suite is strong, but I'm not convinced it is thorough enough to identify all possible regressions.

It's also kind of surprising that any change to the parsing options triggers the use of a completely different parser, especially when one of the options is
useHtmlParser2. For instance:

$ = cheerio.load('<ul id="fruits">...</ul>', {
    xmlMode: true,
    useHtmlParser2: false
});

It's not at all clear that example would actually use htmlparser2. It would be nice to use parse5 for XML also, but I can see why that's not possible.

I agree with @fb55's thoughts about releasing this as part of a new major version for Cheerio. That would:

shield most consumers from regressions, unlikely as they seem (woe to those that use the * version specifier)
give us the opportunity to make a "clean break" from the unfortunate "implicit document fragment" strategy we currently follow

allow us to introduce a more intuitive parser configuration object. I'm open to suggestion about how this should look, but maybe something like this:

  {
    "useHtmlParser2": {
      "withDomLvl1": true,
      "normalizeWhitespace": false,
      "xmlMode": false,
      "decodeEntities": true
    }
  }

Does that sound good to you two?


          Use documents via $.load

45963f5

coveralls commented Mar 6, 2017 •

edited

Loading

Coverage decreased (-1.7%) to 97.29% when pulling 45963f5 on inikulin:parse5 into 51e3645 on cheeriojs:master.

Contributor Author

inikulin commented Mar 6, 2017 •

edited

Loading

OK, I've just implemented changes we've discussed with @fb55 before: $.load always parses in document mode, while $ is used for fragment parsing.

Regarding @jugglinmike's suggestion about parser options: I agree that implicit switching of options doesn't look good. I like the idea of useHtmlParser2 options object. But I'm not sure about moving xmlMode option in it, though. It's commonly used it it not quite intuitive to search it in useHtmlParser2 object. Also, setting xmlMode to true implies that parsing behavior will be changed. I believe it would be better to not care about underlying parser when they use this option.

Member

jugglinmike commented Mar 12, 2017

Good point. I don't want to let difficulty choosing names determine what
features we support, but I'm starting to think that this challenge is actually
a result of feature creep.

Do we really want to commit to simultaneously supporting both parsers for HTML
documents? Instead of allowing the user to control xmlMode and switch
between parse5 and htmlparser2, maybe it makes more sense to just say,
"parse5 is always used for HTML documents, and htmlparser2 is always used
for XML documents."

This would certainly simplify configuration (basically, xmlMode could be
true or an object of options for htmlparser2). But while that resolves the
issue we're currently discussing, I think it might be a wiser choice for the
long-term maintenance of the project. That's because a hard delineation between
document parsers would tend to limit the variation in future bug reports. We
won't have to ask, "which HTML parser are you using?" and users won't have to
wonder the same thing when debugging their own code.

@fb55 I'd love your feedback here, too. Does it seem wise to support a single
parser for each document type? Or am I getting carried away by a preoccupation
with a minor detail about the options object?

Contributor Author

inikulin commented Mar 20, 2017

Any feedback on this?

Member

fb55 commented Mar 21, 2017

Sorry for the delay :/

As far as I remember the main reason to keep htmlparser2 was to have a parser that does not modify the DOM structure, or supports weird DOM structures (self-closing tags!). I would agree that that's in the realm of feature-creep. For XML, isaacs/sax-js is the better option, although it is stricter with what it considers valid XML. (It's also what jsdom is using for xml.)

I'm not sure how we handle the DOM transition best. One option would be to continue to support the old names (ideally with the option to disable it), another to provide a codemod that tries to rename all property accesses. For the latter option we could just stick with parse5's DOM tree and continue to use withDomLvl1 in htmlparser2.

Contributor Author

inikulin commented Mar 24, 2017

To conclude, let's figure out features which should get into this PR and thus in 1.0. I believe we should use options structure proposed by @jugglinmike in #985 (comment), but make xmlMode top-level option. Regarding tree format and switch to sax parser I'm not quite sure if it should get into release. Maybe it makes sense to do changes iteratively? Or we'll need to make major version bump once again after tree format switch?

Member

jugglinmike commented Mar 25, 2017

(@inikulin) I believe we should use options structure proposed by @jugglinmike in #985 (comment), but make xmlMode top-level option.

Actually, I'm not in favor of this modification to my recommendation. This is
what I was describing above when I wrote, "It's also kind of surprising that
any change to the parsing options triggers the use of a completely different
parser [...]" and it also puts us "on the hook" for supporting two different
HTML parsers at the same time (something else I argued against earlier in this
conversation). What I'd like to see is simply:

{ xml: true }

or

{ xml: { /* htmlparser2 options */ } }

Although we may end up removing the second form prior to releasing this.

(@inikulin) Regarding tree format and switch to sax parser I'm not quite sure
if it should get into release. Maybe it makes sense to do changes
iteratively? Or we'll need to make major version bump once again after tree
format switch?

This sounds good to me. You've done a great job here, but I don't want to keep
you "on the hook" for even more. Let's plan to land this in a v1.0 feature
branch. Then @fb55 and I (and anyone else, including yourself) can iterate on
top of it.

...and on that note:

(@fb55) I'm not sure how we handle the DOM transition best. One option would
be to continue to support the old names (ideally with the option to disable
it), another to provide a codemod that tries to rename all property accesses.
For the latter option we could just stick with parse5's DOM tree and continue
to use withDomLvl1 in htmlparser2.

This may be callous of me, but I'm not particularly concerned with the end-user
transition here. Prior to the introduction of DOM level 1 attributes in
htmlparser2 (which occurred over 3 years
ago, believe it or not), Cheerio was
silent about the "node-like" objects that Cheerio collections contained. Since
then, we've carefully documented only the standard
attributes.
So in SemVer terms, removing the non-standard attributes would technically be
valid for even a patch release. Even I can see that would be a little harsh,
but it does make the idea of removing them in a major release seem fair enough.

So to sum up, here's what I'm proposing for our road map:

We update this patch to allow parsing configuration such that htmlparser2
is only used for XML parsing
We land this as a backwards-breaking change to a dedicated v1.0 branch,
We decide whether we want to switch to sax-js
We release version 1.0

Does that sound reasonable to you two?

Contributor Author

inikulin commented Mar 27, 2017

Does that sound reasonable to you two?

Works for me. @fb55, do you have any objections?

Member

fb55 commented Mar 27, 2017 via email

Sounds reasonable, let's do it :)

Contributor

stevenvachon commented Jun 12, 2017

@jugglinmike It's my opinion that normalizeWhitespacedoes not belong in cheerio. Cheerio is for parsing and traversing/scraping. Optimizations should be done with something like html-minifier or htmlnano.

xymbol mentioned this pull request

Automatically adding <html><head><body> ? #1031

Closed

inikulin mentioned this pull request

Unexpected serialization of document fragments inikulin/parse5#199

Closed

Member

jugglinmike commented Jun 24, 2017

The first implementation of this PR used a fallback to htmlparser2 when these
options are specified. Then we decided to avoid such implicit switching of
parsers. Any ideas how we can handle it now to avoid confusion?

Now that we're disallowing htmlparser2 use generally, xmlMode is
superfluous. Parsing options only apply when parsing as XML, so I think the
best way to avoid confusion is to combine the xml and xmlMode options. We
can support all the use cases by simply exposing an option named xml that is
used as follows:

`xml` value	Effect
`undefined`	parse input as HTML using parse5
`true`	parse input as XML using htmlparser2 with the default parsing options
Object	parse input as XML using htmlparser2 with the options specified by the object

I'm hoping this will avoid the kind of confusion voice in this
discussion:

Then I am really confused because the README says the default parser is
parse5 then we say the options are htmlparser2, hard to make sense of this.

Internally, we'll still merge the parsing options to the top-level object and
set xmlMode: true (in order to satisfy htmlparser2), but that is an
implementation detail that we don't need to document.

I've pushed a commit to #1033 that
implements this change and updates the documentation accordingly. Please let me
know what you think!

jugglinmike mentioned this pull request

V1 release notes 2 #1033

Closed

guyisra mentioned this pull request

Is this still maintained? #1121

Closed

fb55 mentioned this pull request

Merged

4 tasks

gajus commented May 8, 2018

Is there a maintained fork with this PR?

macedigital mentioned this pull request

doctype becomes uppercase after using gulp-sri-hash macedigital/gulp-sri-hash#21

Closed

hansottowirtz mentioned this pull request

Support htmlparser2 Automattic/juice#352

Merged


          Merge branch 'master' into parse5

f78de59

Member

fb55 commented Dec 20, 2020

I'm merging this into master to resolve all of the linked issues. Afterwards, I will remove the master branch, rename the current 1.0.0 branch to main and have that be the new top branch.

These changes have already been merged into 1.0.0, so this PR resolved. Thanks a lot @inikulin for putting it up in the first place!

fb55 merged commit 6e115ee into cheeriojs:master

This was referenced Dec 22, 2020

Greater than followed by equal sign is lost #860

Closed

Recusion in parse.js: Maximum call stack size exceeded #240

Closed

<td> tags with no <tr> are not properly handled #422

Closed

Less-than character in text node breaks parser #522

Closed

Unexpected behaviour with custom tag "line" #746

Closed

Cheerio failing with lax HTML structure? #826

Closed

Incorrect innerHTML of <xmp> tag #915

Closed

childCombinator does breaks when there is an unclosed tag #971

Closed

pre tag parsing bugs #997

Closed

If the text contains '<', the following text will be recognized as beginning of another tag #929

Closed

the # symbol inside a text breaks the parser #1075

Closed

Direct Child after body #905

Closed

An immediately closed HTML tag does not parse properly #1200

Closed

HenryQW mentioned this pull request

chore: optimize performance (replace he with entities) DIYgod/RSSHub#6497

Merged

dor-benatia commented Apr 12, 2021

@inikulin can you please explain how this removed the recursive node decoration calls ?

Member

fb55 commented Apr 14, 2021

@dor-benatia This is an independent change; see #1559.

PeachScript mentioned this pull request

fix(preset-umi): modifyHTML hook oom in large static site app umijs/umi#12215

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment