Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragment / Document Question #144

Closed
jescalan opened this issue Jul 28, 2016 · 21 comments
Closed

Fragment / Document Question #144

jescalan opened this issue Jul 28, 2016 · 21 comments
Labels

Comments

@jescalan
Copy link

Hi there! Thanks so much for making this fantastic library first of all 💖

So I have a use case where I am wrapping it around a library that essentially allows partials/includes, so you could do something like this:

<doctype html>
<html>
  <include src='./head.html'>
  <body>
    <p>hello world!</p>
  </body>
</html>

And let's say for example that head.html was:

<head>
  <title>Example Page</title>
  <!-- other meta info -->
</head>

I'm trying to figure out how I can get parse5 to be able to handle this situation. Using the html fragment parse appears to remove head, body, and html tags, but using a normal document parse adds in a bunch of extra tags (doctype, head, body) that are not really necessary for this situation (although I do understand why they are added).

Is it possible to use parse5 for a task like this? Is there some type of parse mode that won't alter the tags, or a way for me to get the fragment parse not to strip tags? Also is there documentation anywhere on which tags are stripped by fragment parse mode, and/or added by full document mode?

@inikulin
Copy link
Owner

inikulin commented Jul 28, 2016

Hi!
Thank you for the kind words, really appreciate it!

Is it possible to use parse5 for a task like this? Is there some type of parse mode that won't alter the tags, or a way for me to get the fragment parse not to strip tags?

We discussing such thing in #132. It's still debatable, but you can give your upvote for this feature and drop your scenario as an argument there.

Also is there documentation anywhere on which tags are stripped by fragment parse mode, and/or added by full document mode?

As far as I remember this is not documented explicitly. Regarding fragment parsing, if you're use parsing without context element, <template> context will be used. In that case <html>, <head> and <body> will be stripped.

Following tags will always be implicitly added by full document parsing mode:

  • <html> if missing
  • <head> if missing
  • <body> if missing
  • <p> - if </p> occurs without open tag
  • <colgroup> if <col> added directly to <table>
  • <tbody> if <td>, <th> or <tr> added directly to <table>
  • <tr> if <td> or <th> added directly to <table>

I hope i didn't forget anything.

@jescalan
Copy link
Author

@inikulin perfect, thank you for the quick and thorough response! Just pitched in at the linked issue. Would be happy to help out as well if someone is willing to hand-hold me a bit at the beginning, just bc this is a large and unfamiliar code base.

Also about what you were saying with the context, would it be possible to work around the issue in the meantime by providing a different context explicitly, maybe something like an <html> element? I feel like probably not, but worth a shot!

@inikulin
Copy link
Owner

would it be possible to work around the issue in the meantime by providing a different context explicitly, maybe something like an element?

Yeah, you can pass <html> as context, but <body> and <head> will be generated implicitly anyway if they are missing.

perfect, thank you for the quick and thorough response! Just pitched in at the linked issue. Would be happy to help out as well if someone is willing to hand-hold me a bit at the beginning, just bc this is a large and unfamiliar code base.

Need some time to figure out how it will be actually done (more likely it will be a separate package on top of parse5, but we need to expose some API first). Unfortunately, I'm extremely busy right now and stepped away from parse5 development for some time. I hope I'll be back in late August, but meanwhile maybe @RReverser could help?

@jescalan
Copy link
Author

jescalan commented Aug 2, 2016

Hi, just following up quickly, @RReverser would you be able to help a little? This issue is time sensitive for me, but I am willing to put time into helping 😁

@inikulin
Copy link
Owner

inikulin commented Aug 2, 2016

@jescalan I'll try to release new parse5 version on Thursday which includes some great updates to our SAXParser made by @RReverser and we will try to build some basic solution on top of it.

@jescalan
Copy link
Author

jescalan commented Aug 2, 2016

@inikulin would be amazing. even if it's just a patch for now that's ok 😁

I'm looking through the code now and there's really quite a lot to navigate. I'll keep trying though!

@jescalan
Copy link
Author

jescalan commented Aug 2, 2016

Wait I'm messing with the SAXParser right now, is there any reason that I wouldn't be able to build what I'm after here out of the SAXParser without any additional updates? It seems like it handles all tags already...

@inikulin
Copy link
Owner

inikulin commented Aug 2, 2016

@jescalan Yeah, new release will not bring any API changes, but it contains some important fixes for the SAXParser. Anyway, you can already start prototyping. The idea is quite simple: maintain own open element stack, on startTag event of SAXParser create element using tree adapter and if it's not in list of void elements put in into stack. Append nodes and elements to the top element on the stack (add document or documentFragment to the stack before you start parsing). Once you encounter end tag - pop elements up to matching element or until only document left on the stack. To deal with tree you can use one of provided tree adapters, you can find their API description in the docs.

@jescalan
Copy link
Author

jescalan commented Aug 2, 2016

@inikulin Great, I have this mostly built out and it's working pretty well 🎉 Will post here when it's entirely finished. I'm running into an issue with self-closing tags though. It seems like it will only detect them if using the closing slash like <br />. If it's missing the closing slash (which is still valid html), it doesn't mark the tag as self-closing. Is this a bug, or am I missing something?

EDIT: It's only doing this when I don't have a doctype set in the same fragment. This is still an issue for me though, as it's possible that I'll need to parse a fragment which doesn't explicitly contain a doctype. Is there a way to set the doctype manually? I don't see one in the docs...

@inikulin
Copy link
Owner

inikulin commented Aug 2, 2016

@jescalan It shouldn't be related to doctype. Regarding self-closing tags: https://github.com/inikulin/parse5/wiki/Documentation#q-im-parsing-img-srcfoo-with-the-saxparser-and-i-expect-the-selfclosing-flag-to-be-true-for-the-img-tag-but-its-not-is-there-something-wrong-with-the-parser - you need to check against the list of void elements as I mentioned in comment above.

@jescalan
Copy link
Author

jescalan commented Aug 2, 2016

@inikulin ah, i didn't really get what you meant with the void elements at first, now it makes a lot more sense. thanks for clearing that up!

@jescalan
Copy link
Author

jescalan commented Aug 2, 2016

@inikulin Ok working well with the void tags 😄 One more question -- I have a test in here to see if it will parse plain text that's not inside any tags, and I'm not getting anything back from the parser on this one. Does the SAXParser parse plain text, or does it need to be contained inside a tag?

@inikulin
Copy link
Owner

inikulin commented Aug 2, 2016

Hmm, it should parse plain text as is: https://tonicdev.com/57a10ab6594ef21300a7a1ad/57a11834d2ab3913009ee831

@jescalan
Copy link
Author

jescalan commented Aug 2, 2016

Working now with your method of pushing a string into the stream. I was using a different way that was not, for some reason 👍

@stevenvachon
Copy link
Contributor

@inikulin will SAXParser be on par with the regular parser in terms of "back-checking" DOM corrections?

<html>
<body>

<div><body class="addition"></body></div>

</body>
</html>

@inikulin
Copy link
Owner

inikulin commented Aug 4, 2016

@stevenvachon It will not perform any tree structure correction, that's the point of this whole thread.

@jescalan
Copy link
Author

Just as a wrap-up, did end up getting this working in the end, thanks to the brilliant @inikulin's help. Result can be seen here: https://github.com/reshape/parser 🎉

@thisconnect
Copy link

@jescalan have you considere link rel=import instead of a custom include element?

http://webcomponents.org/articles/introduction-to-html-imports/

@jescalan
Copy link
Author

jescalan commented Sep 2, 2016

@thisconnect absolutely, but you need to be using http/2 (preferably with server push) in order for that to make sense, and not everyone has fully made that transition yet. As soon as http/2 with push becomes more standard, the include element will probably be used much less often, if ever.

@inikulin
Copy link
Owner

inikulin commented Sep 2, 2016

@thisconnect @jescalan Guys, I have a feeling that this discussion doesn't belong to parse5. Can you choose another medium to proceed with your conversation to not spam those who watching this repo, please?

@thisconnect
Copy link

Sure sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants