Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-parsing of parser-produced HTML causes change of document semantics #1280

Closed
inikulin opened this issue May 19, 2016 · 9 comments
Closed

Comments

@inikulin
Copy link
Member

inikulin commented May 19, 2016

Consider we have following markup:

<html>
    <head></head>
    <body>
        <form action="url1">
            <div>
        </form>
        <form action="url2">
            <input type="hidden" id="yo" />
        </form>
    </body>
</html>

it will be parsed as:

<html>
    <head></head>
    <body>
      <form action="url1">
        <div>
          <form action="url2">
            <input type="hidden" id="yo">
          </form>
        </div>
      </form>
    </body>
</html>

This is invalid markup, but it works just fine in DOM representation.
Now if you serialize it back to HTML and re-parse it, e.g. doing:

document.body.innerHTML = document.body.innerHTML;

You will get:

<html>
    <head></head>
    <body>
        <form action="url1">
           <div>
               <input type="hidden" id="yo">
           </div>
        </form>
    </body>
</html>

Which completely changes semantics of document: input now has associated form with url1.
As a possible workaround we can add foster parenting-style algorithm for forms: we will just bailout any nested forms.

@inikulin inikulin changed the title Re-parsing of parser produced HTML causes change of document semantics Re-parsing of parser-produced HTML causes change of document semantics May 19, 2016
@zcorpan
Copy link
Member

zcorpan commented May 19, 2016

This is indeed a problem. Unfortunately we can't change how the original markup is parsed. Your proposed workaround also doesn't help in general. Consider <table><form><tr><td><input>. First pass the form is empty, child of the table, but the input is still associated with it. On serialize-reparse, the DOM is the same but the association is lost. There are no doubt more examples.

Also, come to think of it, foster parenting can also cause nested ps or nested as, and I think AAA can also cause nested as, which doesn't roundtrip.

@inikulin
Copy link
Member Author

inikulin commented May 19, 2016

Also, come to think of it, foster parenting can also cause nested ps or nested as, and I think AAA can also cause nested as, which doesn't roundtrip.

But at least, nested a will just bailout and will work just fine, while here things breaks completely.

@inikulin
Copy link
Member Author

inikulin commented May 19, 2016

Consider <table><form><tr><td><input>. First pass the form is empty, child of the table, but the input is still associated with it.

We can remove association on foster parenting if <form> has non-descendant associated <input>s. Or it will not be web-compatible? As an alternative we can move such associated non-descendant inputs into <form>. This will brake layout but will keep form functional even after reparsing.

@zcorpan
Copy link
Member

zcorpan commented May 19, 2016

Whether things work just fine is very much dependent on what the page is doing. Pages break completely for seemingly minor things.

My example doesn't do foster parenting. But also breaking form association or moving form controls in the DOM is certainly going to break pages.

Note that legacy IE did usually roundtrip form association (but it didn't use a tree for the DOM), and it was a known problem that form association didn't work with innerHTML. But it was rare enough that browsers got away with breaking it.

@inikulin
Copy link
Member Author

My example doesn't do foster parenting.

Indeed, tr triggers fake tbody insertion which clears open elements stack to table context thus any further content is appended to tbody.

But also breaking form association or moving form controls in the DOM is certainly going to break pages.

So, I guess we don't have any options left except keeping it as is?

@RReverser
Copy link
Member

Can't we change serialization algorithm itself?

@zcorpan
Copy link
Member

zcorpan commented May 20, 2016

How?

@RReverser
Copy link
Member

That's another question 😄

@zcorpan
Copy link
Member

zcorpan commented May 20, 2016

I think we basically can't change the spec to fix this. It would require substantial changes with high risk of breaking Web compat. However the spec can do a better job of pointing out that this issue exists and maybe what can be done about it outside of the browser itself, similar to https://html.spec.whatwg.org/multipage/syntax.html#coercing-an-html-dom-into-an-infoset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants