LexicalTreeParser #132

inikulin · 2016-05-19T20:38:42Z

What?

Parser which preserves lexical nesting order of nodes in source document. Speaking clearly, we will just run simple nesting logic on top of SAXParser machinery.

Why?

For some use scenarios (e.g. content-modifying proxy) it's important to preserve page semantics / layout, which may broke during reparsing (see examples in whatwg/html#1280). Other use cases that comes to my mind: code folding in editors, element attributes instrumentation, text data extraction, also Cheerio willing to replace htmlparser2 with parse5, so I guess it would be nice to have optional "forgiving parsing" (I know... 🐐) mode.

The text was updated successfully, but these errors were encountered:

domenic · 2016-05-19T21:00:30Z

Seems bad... very un-parse5 like

inikulin · 2016-05-19T21:19:09Z

@domenic Well, content-modifying proxy scenario really bothers me. I guess we can go with separate package named forgive-me-whatwg which will use SAXParser + tree adapter + black magic under the hood. But maybe someone has some better ideas?

domenic · 2016-05-19T21:51:33Z

I guess maybe the serialization algorithm is what needs fixing for the content-modifying proxy scenario?

RReverser · 2016-05-19T21:51:44Z

@inikulin Is there a possibility to do this via custom external adapter or, if not, expose APIs that could allow it to do this? (Ideally, I'd like to fix the serialization in spec, but apparently that's unlikely)

inikulin · 2016-05-19T22:02:48Z

@RReverser Yes, I have an idea how we can do this.

zcorpan · 2016-05-20T08:20:18Z

I think this is actually a reasonable approach for parsing things you know has been serialized. Maybe it can fatal error if it hits an unexpected end tag? Also you still need to handle the same special cases as the HTML serializer does, at least - void elements, title/textarea/style/script, plaintext (for this one you need to change the serializer also to just stop serializing before </plaintext>), template, etc. But you also need to deal with foreign content, consider <title><x></x></title><svg><title><x></x></title></svg> - the HTML title has a text node child, the SVG title has an element x child.

If you want to make form association survive, you could, after the first parse, check each form control that is not a descendant of a form element and doesn't have a form attribute already; if it has a form owner, set a form attribute on it that points to the form element (and set an id on that form if it doesn't have one already).

If you want to make the document mode survive (quirks mode, almost quirks or no-quirks), that also needs something for cases like <!doctype html />.

zcorpan · 2016-05-20T08:22:48Z

Hmm maybe shouldn't try to fix plaintext, I think it's unfixable for the foster parenting case.

<table><tr><td>foo</td></tr>
<plaintext>bar

inikulin · 2016-05-20T09:37:47Z

Also you still need to handle the same special cases as the HTML serializer does, at least - void elements, title/textarea/style/script, plaintext (for this one you need to change the serializer also to just stop serializing before </plaintext>), template, etc. But you also need to deal with foreign content, consider <title></title><title></title> - the HTML title has a text node child, the SVG title has an element x child.

We have ParserFeedbackSimulator which can handle all these cases. My idea was to run tokenizer + feedback simulator and just maintain simple open elements stack (end tag closes all elements in the stack up to the element with this tag name, void elements automatically popped out)

jescalan · 2016-07-28T15:54:34Z

Hi there! Coming on over from #144 to add my support for this. Working on posthtml, which is essentially a plugin-based html transform system for html, much like (you guessed it) postcss is for css. I really like parse5, it's extremely thorough and stable, and its line location info is a huge assistance for error messages from plugins and source maps.
But it needs just this small amount more flexibility in order to be viable for plugin authors and users.

Right now this is a blocking issue for me, I'm working full time trying to get to a stable release. I'm more than happy to help out with the implementation here if we could make this happen faster, if someone could give me a little walkthrough of the codebase!

wooorm · 2016-09-04T19:00:52Z

I’m also in need of the lexical tree. Plus, I want it to patch that tree with automatically inserted* elements, optionally.

I’m not sure if it’s possible to unlink the two, but if it is, I don’t see how it’s “un-parse5 like” to do that if the core API can stay the same?

* What’s the proper term here?

domenic · 2016-09-04T20:12:34Z

Well, a HTML parser is something that follows the HTML Standard and produces a DOM tree. I guess you might be looking for something like a HTML lexer or tokenizer, although I don't know what kind of object that would produce (not a DOM), and there's no standard governing its behavior. That's why it's fairly un-parse5-like to attempt to add such features to parse5, which is a HTML parser library.

inikulin · 2016-10-19T19:47:45Z

I'll return to this topic later with better solution. Meanwhile for such scenarios I suggest to use https://github.com/reshape/parser

inikulin added the proposal label May 19, 2016

inikulin mentioned this issue Jun 6, 2016

Properly test serializer #137

Closed

inikulin mentioned this issue Jul 28, 2016

Fragment / Document Question #144

Closed

This was referenced Aug 26, 2016

Make it possible to use parse5 as optional(?) parser cheeriojs/cheerio#863

Closed

White space between nodes in #document, html #150

Closed

inikulin closed this as completed Oct 19, 2016

inikulin mentioned this issue Jun 16, 2017

Suggestion: Incremental Parsing #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LexicalTreeParser #132

LexicalTreeParser #132

inikulin commented May 19, 2016 •

edited

Loading

domenic commented May 19, 2016

inikulin commented May 19, 2016

domenic commented May 19, 2016

RReverser commented May 19, 2016

inikulin commented May 19, 2016

zcorpan commented May 20, 2016

zcorpan commented May 20, 2016

inikulin commented May 20, 2016 •

edited

Loading

jescalan commented Jul 28, 2016

wooorm commented Sep 4, 2016

domenic commented Sep 4, 2016

inikulin commented Oct 19, 2016

LexicalTreeParser #132

LexicalTreeParser #132

Comments

inikulin commented May 19, 2016 • edited Loading

What?

Why?

domenic commented May 19, 2016

inikulin commented May 19, 2016

domenic commented May 19, 2016

RReverser commented May 19, 2016

inikulin commented May 19, 2016

zcorpan commented May 20, 2016

zcorpan commented May 20, 2016

inikulin commented May 20, 2016 • edited Loading

jescalan commented Jul 28, 2016

wooorm commented Sep 4, 2016

domenic commented Sep 4, 2016

inikulin commented Oct 19, 2016

inikulin commented May 19, 2016 •

edited

Loading

inikulin commented May 20, 2016 •

edited

Loading