Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse5 as a default parser (closes #863) #985

Merged
merged 5 commits into from
Dec 20, 2020
Merged

Conversation

inikulin
Copy link
Contributor

@inikulin inikulin commented Feb 21, 2017

This PR makes parse5 a default parser for cheerio. If any parsing options are set to non-default values cheerio fallbacks to htmlparser2. Additionally, useHtmlParser2 option was introduced to force usage of htmlparser2.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 98.956% when pulling 0ac3996 on inikulin:parse5 into 51e3645 on cheeriojs:master.

1 similar comment
@coveralls
Copy link

coveralls commented Feb 21, 2017

Coverage Status

Coverage increased (+0.006%) to 98.956% when pulling 0ac3996 on inikulin:parse5 into 51e3645 on cheeriojs:master.

@inikulin inikulin changed the title Use parse5 as a default parser (closes #863) [WIP] Use parse5 as a default parser (closes #863) Feb 25, 2017
@inikulin
Copy link
Contributor Author

inikulin commented Feb 25, 2017

Please, do not merge yet. I'll go through open issues and reference the ones that will be closed by this PR along with regression tests.

Update: done (see OP post). \cc @jugglinmike @fb55

@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

4 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 98.956% when pulling 4fa8547 on inikulin:parse5 into 51e3645 on cheeriojs:master.

@jugglinmike
Copy link
Member

This is awesome, thank you! I will give it a look as soon as I am able.

@inikulin inikulin changed the title [WIP] Use parse5 as a default parser (closes #863) Use parse5 as a default parser (closes #863) Feb 25, 2017
@tommaton
Copy link

tommaton commented Feb 27, 2017

Hey guys, I know everyone on is very busy, but wondering if there any timescale on when this will be merged in as would really like to start using parse5 with cheerio in a new project.

@jugglinmike
Copy link
Member

Not currently, no. Please remember that this is a non-trivial change and calls for careful review.

Copy link
Member

@fb55 fb55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

@@ -164,14 +164,15 @@ $ = cheerio.load('<ul id="fruits">...</ul>', {
});
```

These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. The default options are:
These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. If any of these options is set to non-default value cheerio will implicitly use `htmlparser2` as an underlying parser. In addition, you can use `useHtmlParser2` option to force cheerio use `htmlparser2` instead of `parse5`. The default options are:

```js
{
withDomLvl1: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is withDomLvl1 implemented in the parse5 tree adapter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question, so I experimented a bit and found that Dom Level 1 support is analogous. Here's the relevant source code. Great!

lib/parse.js Outdated

// Update the dom using the root
exports.update(dom, root);

return root;
};

function parseWithParse5 (content) {
var parseAsDocument = /^(\s|<!--.*?-->)*?<(!doctype|html|head|body)(.*?)>/i.test(content),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a smart hack, I'm not sure if it's expected though. IMHO we should only parse fragments right now and switch to full parsing in cheerio@1.0

Copy link
Contributor Author

@inikulin inikulin Feb 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'll always use fragment parsing parser will omit <html>, <body> and <head> tags and doctypes, so it will basically will return empty AST for full documents. On the other hand, enabling full parsing will always add these tags. Even in 1.0 you will still need this hack.

Copy link
Contributor Author

@inikulin inikulin Feb 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking clearly, it just enables the behavior that we currently have in cheerio. Unless we'll not add explicit API for document and fragment parsing, but as far as I can tell, it will lead to breaking change.

@fb55
Copy link
Member

fb55 commented Feb 27, 2017 via email

@inikulin
Copy link
Contributor Author

inikulin commented Feb 27, 2017

@fb OK, let's stick with this plan:

  • use full-parsing for .load()
  • use fragment parsing for $
  • switch to parse5 AST
  • add optional node decorators for backward compatibility

@inikulin
Copy link
Contributor Author

inikulin commented Feb 27, 2017

I've just realized that it will not be possible to switch to parse5 AST, since we use htmlparser2 for XML.

@jugglinmike
Copy link
Member

Let me start be re-iterating: this is really great work! You've clearly done your homework when it comes to Cheerio's known bugs and backwards compatibility.

This is a significant change to the core behavior of the library. While the value is clear, it's also a risk. Our test suite is strong, but I'm not convinced it is thorough enough to identify all possible regressions.

It's also kind of surprising that any change to the parsing options triggers the use of a completely different parser, especially when one of the options is
useHtmlParser2. For instance:

$ = cheerio.load('<ul id="fruits">...</ul>', {
    xmlMode: true,
    useHtmlParser2: false
});

It's not at all clear that example would actually use htmlparser2. It would be nice to use parse5 for XML also, but I can see why that's not possible.

I agree with @fb55's thoughts about releasing this as part of a new major version for Cheerio. That would:

  • shield most consumers from regressions, unlikely as they seem (woe to those that use the * version specifier)

  • give us the opportunity to make a "clean break" from the unfortunate "implicit document fragment" strategy we currently follow

  • allow us to introduce a more intuitive parser configuration object. I'm open to suggestion about how this should look, but maybe something like this:

      {
        "useHtmlParser2": {
          "withDomLvl1": true,
          "normalizeWhitespace": false,
          "xmlMode": false,
          "decodeEntities": true
        }
      }

Does that sound good to you two?

@coveralls
Copy link

coveralls commented Mar 6, 2017

Coverage Status

Coverage decreased (-1.7%) to 97.29% when pulling 45963f5 on inikulin:parse5 into 51e3645 on cheeriojs:master.

@inikulin
Copy link
Contributor Author

inikulin commented Mar 6, 2017

OK, I've just implemented changes we've discussed with @fb55 before: $.load always parses in document mode, while $ is used for fragment parsing.

Regarding @jugglinmike's suggestion about parser options: I agree that implicit switching of options doesn't look good. I like the idea of useHtmlParser2 options object. But I'm not sure about moving xmlMode option in it, though. It's commonly used it it not quite intuitive to search it in useHtmlParser2 object. Also, setting xmlMode to true implies that parsing behavior will be changed. I believe it would be better to not care about underlying parser when they use this option.

@jugglinmike
Copy link
Member

Good point. I don't want to let difficulty choosing names determine what
features we support, but I'm starting to think that this challenge is actually
a result of feature creep.

Do we really want to commit to simultaneously supporting both parsers for HTML
documents? Instead of allowing the user to control xmlMode and switch
between parse5 and htmlparser2, maybe it makes more sense to just say,
"parse5 is always used for HTML documents, and htmlparser2 is always used
for XML documents."

This would certainly simplify configuration (basically, xmlMode could be
true or an object of options for htmlparser2). But while that resolves the
issue we're currently discussing, I think it might be a wiser choice for the
long-term maintenance of the project. That's because a hard delineation between
document parsers would tend to limit the variation in future bug reports. We
won't have to ask, "which HTML parser are you using?" and users won't have to
wonder the same thing when debugging their own code.

@fb55 I'd love your feedback here, too. Does it seem wise to support a single
parser for each document type? Or am I getting carried away by a preoccupation
with a minor detail about the options object?

@inikulin
Copy link
Contributor Author

Any feedback on this?

@fb55
Copy link
Member

fb55 commented Mar 21, 2017

Sorry for the delay :/

As far as I remember the main reason to keep htmlparser2 was to have a parser that does not modify the DOM structure, or supports weird DOM structures (self-closing tags!). I would agree that that's in the realm of feature-creep. For XML, isaacs/sax-js is the better option, although it is stricter with what it considers valid XML. (It's also what jsdom is using for xml.)

I'm not sure how we handle the DOM transition best. One option would be to continue to support the old names (ideally with the option to disable it), another to provide a codemod that tries to rename all property accesses. For the latter option we could just stick with parse5's DOM tree and continue to use withDomLvl1 in htmlparser2.

@inikulin
Copy link
Contributor Author

To conclude, let's figure out features which should get into this PR and thus in 1.0. I believe we should use options structure proposed by @jugglinmike in #985 (comment), but make xmlMode top-level option. Regarding tree format and switch to sax parser I'm not quite sure if it should get into release. Maybe it makes sense to do changes iteratively? Or we'll need to make major version bump once again after tree format switch?

@jugglinmike
Copy link
Member

(@inikulin) I believe we should use options structure proposed by @jugglinmike in #985 (comment), but make xmlMode top-level option.

Actually, I'm not in favor of this modification to my recommendation. This is
what I was describing above when I wrote, "It's also kind of surprising that
any change to the parsing options triggers the use of a completely different
parser [...]" and it also puts us "on the hook" for supporting two different
HTML parsers at the same time (something else I argued against earlier in this
conversation). What I'd like to see is simply:

{ xml: true }

or

{ xml: { /* htmlparser2 options */ } }

Although we may end up removing the second form prior to releasing this.

(@inikulin) Regarding tree format and switch to sax parser I'm not quite sure
if it should get into release. Maybe it makes sense to do changes
iteratively? Or we'll need to make major version bump once again after tree
format switch?

This sounds good to me. You've done a great job here, but I don't want to keep
you "on the hook" for even more. Let's plan to land this in a v1.0 feature
branch. Then @fb55 and I (and anyone else, including yourself) can iterate on
top of it.

...and on that note:

(@fb55) I'm not sure how we handle the DOM transition best. One option would
be to continue to support the old names (ideally with the option to disable
it), another to provide a codemod that tries to rename all property accesses.
For the latter option we could just stick with parse5's DOM tree and continue
to use withDomLvl1 in htmlparser2.

This may be callous of me, but I'm not particularly concerned with the end-user
transition here. Prior to the introduction of DOM level 1 attributes in
htmlparser2 (which occurred over 3 years
ago
, believe it or not), Cheerio was
silent about the "node-like" objects that Cheerio collections contained. Since
then, we've carefully documented only the standard
attributes
.
So in SemVer terms, removing the non-standard attributes would technically be
valid for even a patch release. Even I can see that would be a little harsh,
but it does make the idea of removing them in a major release seem fair enough.

So to sum up, here's what I'm proposing for our road map:

  1. We update this patch to allow parsing configuration such that htmlparser2
    is only used for XML parsing
  2. We land this as a backwards-breaking change to a dedicated v1.0 branch,
  3. We decide whether we want to switch to sax-js
  4. We release version 1.0

Does that sound reasonable to you two?

@inikulin
Copy link
Contributor Author

Does that sound reasonable to you two?

Works for me. @fb55, do you have any objections?

@fb55
Copy link
Member

fb55 commented Mar 27, 2017 via email

@stevenvachon
Copy link
Contributor

@jugglinmike It's my opinion that normalizeWhitespacedoes not belong in cheerio. Cheerio is for parsing and traversing/scraping. Optimizations should be done with something like html-minifier or htmlnano.

@jugglinmike
Copy link
Member

The first implementation of this PR used a fallback to htmlparser2 when these
options are specified. Then we decided to avoid such implicit switching of
parsers. Any ideas how we can handle it now to avoid confusion?

Now that we're disallowing htmlparser2 use generally, xmlMode is
superfluous. Parsing options only apply when parsing as XML, so I think the
best way to avoid confusion is to combine the xml and xmlMode options. We
can support all the use cases by simply exposing an option named xml that is
used as follows:

xml value Effect
undefined parse input as HTML using parse5
true parse input as XML using htmlparser2 with the default parsing options
Object parse input as XML using htmlparser2 with the options specified by the object

I'm hoping this will avoid the kind of confusion voice in this
discussion
:

Then I am really confused because the README says the default parser is
parse5 then we say the options are htmlparser2, hard to make sense of this.

Internally, we'll still merge the parsing options to the top-level object and
set xmlMode: true (in order to satisfy htmlparser2), but that is an
implementation detail that we don't need to document.

I've pushed a commit to #1033 that
implements this change and updates the documentation accordingly. Please let me
know what you think!

@jugglinmike jugglinmike mentioned this pull request Jun 24, 2017
@fb55 fb55 mentioned this pull request Feb 21, 2018
4 tasks
@gajus
Copy link

gajus commented May 8, 2018

Is there a maintained fork with this PR?

@fb55
Copy link
Member

fb55 commented Dec 20, 2020

I'm merging this into master to resolve all of the linked issues. Afterwards, I will remove the master branch, rename the current 1.0.0 branch to main and have that be the new top branch.

These changes have already been merged into 1.0.0, so this PR resolved. Thanks a lot @inikulin for putting it up in the first place!

@dor-benatia
Copy link

@inikulin can you please explain how this removed the recursive node decoration calls ?

@fb55
Copy link
Member

fb55 commented Apr 14, 2021

@dor-benatia This is an independent change; see #1559.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment