-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Valid/Invalid characters in document.createElement() #849
Comments
I agree with your reading of the spec. Normally I would say we should change the spec to match. However, note #449 and the various issues linked from there; web developers would really like it if the rules were more permissive. (Although I think what they really want is for the DOM APIs to accept at least a superset of the HTML parser; I haven't checked what the HTML parser does with your example.) So, maybe we could take this as an opportunity to solve that long-standing issue? :) |
Here is the reporter of the Chromium issue mentioned above. Just wanted to note that removing the character range customElements.define("emotion-😍", HTMLElement); // OK
document.createElement("emotion-😍"); // Allowed per spec, error in Chrome and Firefox So if the consense is that supplementary Unicode characters should be banned, I would recommend to also change the HTML rules accordingly. |
Thanks for the response @domenic, I can see how this is just a part of making all the DOM APIs accept more characters. I saw another crbug about a similar issue here, and found a corresponding spec bug here. Thanks for the elaboration @fasttime, I agree that it would make sense for customElements.define() to match document.createElement() |
I think it's probably not that big. In #449 and previous issues we were aiming for alignment, i.e. making the parser and the DOM APIs match. It took us a good amount of work (use counters, etc.) to discover that was likely not web-compatible. (Also, looking at the bug, it seems the process also involved educating me about namespace-related things... fun times.) But with that foundational work out of the way, I think this becomes simpler. I.e., now we know it's just a matter of updating the spec, writing some tests, and filing/fixing some browser bugs. The change is just one line in the spec, to accept more things than were accepted before. And there shouldn't be any compat worries. I think most people are for such a change, although it's been low priority for browser vendors. (Until now? 😄) |
Sounds great! I'd *imagine* that it wouldn't be that hard to implement either. |
I think the APIs are in two sets: elements and attributes. You could do the two sets together, or separately.
|
Thanks, that makes sense. Should we just change the spec and implement it? Should we ask for support from firefox and safari? |
My suggestion is to do a spec pull request to make it more concrete, and iterate on that (I'm happy to help) to make sure we've got it right. Then we can ask for broader support/review. As for when to work on implementation and tests, that's a judgment call. The potential downside is doing implementation/test work and finding out that we missed something that makes it show-stoppingly objectionable, which would mean the implementation/test work is "wasted". But in this case I think it'd be relatively safe to do so in parallel, as I am hopeful that this change will not be controversial. |
Here is an initial stab at rules that take the union of what current DOM APIs allow and what the parser allows. Someone double-checking would be great; if I can get confirmation that they seem right, I can probably spend some time on a spec PR.
Probably we should not touch custom element name rules. We could in theory make PCENChar similarly lenient to LenientNameChar, but I'm not sure that leniency actually is a good idea for them, since Although I've phrased the above in terms of hypothetical grammar productions (e.g. LenientElementNameStartChar) the actual spec would probably be better as algorithms that loop over code units/code points, since that is how they're implemented. And per the OP of this thread the current implementations have bugs, which I suspect might be due to the attempt at translating from grammar specifications into algorithms. |
Alternate approach suggested by @annevk: be maximally lenient in the DOM APIs. The difference from the above would be something like:
The upside of this is that it simplifies the spec and implementation a good bit. The intention of the current restrictions from the XML spec seems to be a misguided segmentation of Unicode which hasn't really panned out in practice. It might also give small performance wins; it's hard to say since on the one hand this is definitely a microoptimization, but on the other hand The downside is that this allows more cases where DOM APIs can produce unparseable HTML. I.e., currently you can produce HTML which does not survive a parse-serialize roundtrip using something like Additionally, if we're giving up on parse-serialize roundtrips anyway, it's not clear whether disallowing tab, LF, CR, FF, space, /, >, NULL is important either! We could just allow any string... |
So I'm generally supportive of the second, "maximally lenient" approach. It makes the spec, the implementation, and the developer understanding of this behavior straightforward, which is good for performance, bugs, and happiness. Except I do worry about this downside:
My concern isn't so much that it'll allow non-roundtrippable DOM, but more that this might introduce mXSS type security concerns. For example, I'm quite sure we don't want to ever allow |
For others who weren't aware of mXSS, it apparently is about malicious users supplying strings which are stored in a database, and then passed to So I guess the relevance here is, if someone does const el = document.createElement(userSuppliedString);
await storeInServer(el.outerHTML);
// ... later ...
const html = await getFromServer();
container.innerHTML = html; and If we disallow tab, LF, CR, FF, space, /, >, NULL, then nothing immediately sticks out to me as problematic. But I worry that I'm just not doing enough security analysis. It would be a shame if this simplification fix ended up causing everyone security problems. From that perspective, the #849 (comment) should be generally safer. It would make Hmm. |
That's a great point, though I'd like to actually solve this a bit more before considering options. It seems that currently you can put the parser in a different state by using The HTML parser on the other hand requires What does this mean? I'm not entirely sure. It could be that inputs such as I wonder if we can get to more lenient rules safely if we better account for ASCII. If the first code point is If the first code point is not This is both stricter and more lenient than what @domenic wrote above. It's more strict when it comes to ASCII which I think is a good thing as that is the risky area of potential state transitions. It's "everything goes" for non-ASCII which I think is good mainly from the point of reducing complexity of the types of checks we need to perform. Thoughts? |
@annevk I think that it would also be reasonable to disallow lone surrogates. It’s possible to get them into the input stream with document.write and they’re accepted as chardata unchanged via innerHTML, etc, but they’d normally end up converted to U+FFFD in html document parsing scenarios (e.g. if one “appeared” in a document parsed as UTF-8, WTF-8 style). Given the translation there would be non-ascii→non-ascii, it may not be a major hazard, but it’s pretty hard to imagine any benefit to allowing them in names and not too hard to imagine it causing problems. |
@mfreed7 asked me about my thoughts on security implications of being more lenient in tag names. So here are my two cents. @domenic wrote:
I think it's a reasonable assumption that the user would also be able to control at least some parts of the content if they can supply the tag name. If that's the case, then That being said, I've never been under impression that the design goal of DOM APIs is to disallow reparsing attacks. There are countless examples of "manually" created DOM trees that will execute JavaScript after reparsing. Even the HTML spec admits the problem. Three examples of that: // Example 1
const div = document.createElement('div');
div.append(document.createComment('--><script>alert(1)</script>'));
console.log("<div><!----><img src onerror=alert(1)>--></div>");
// logs: "<div><!----><img src onerror=alert(1)>--></div>"
// Example 2
const style = document.createElement('style');
style.textContent = '</style><img src onerror=alert(1)>';
console.log(style.outerHTML);
// logs: "<style></style><img src onerror=alert(1)></style>"
// Example 3
const textarea = document.createElement('textarea');
const div2 = document.createElement('div');
div2.setAttribute('title', '</textarea><img src onerror=alert(1)>');
textarea.append(div2);
// logs: "<textarea><div title="</textarea><img src onerror=alert(1)>"></div></textarea>" Furthermore, going back to the example given by @domenic:
I'd say that no matter what happens in So my personal take is that if we disallow tab, LF, CR, FF, space, /, >, NULL, we should be fine. This way, we disallow creating a reparsing issue from a single call to |
For clarity, I was asked to write down #849 (comment) a bit more formally. I still think that idea is reasonable as although I agree we do not defend against reparsing attacks, it still seems good to not introduce novel reparsing attacks. Using XML EBNF's notation: NewCreateElementName ::= HTMLParserCompatibleName | CreateElementCompatibleName
HTMLParserCompatibleName ::= [a-zA-Z] [^#x00#x09#x0A#0xCx0D#x20/>]*
CreateElementCompatibleName ::= CreateElementCompatibleNameStartChar (CreateElementCompatibleNameChar)*
CreateElementCompatibleNameStartChar ::= ":" | "_" | [#x80-#x10FFFF]
CreateElementCompatibleNameChar ::= CreateElementCompatibleNameStartChar | [a-zA-Z] | "-" | "." | [0-9] (I would be okay with attempting to ban surrogates, but it feels a bit murky given that you can create elements in the HTML parser that contain them still. I'd be more comfortable closing that hole if we closed it there at the same time.) |
I'd prefer if API + HTML parser + CSS naming rules would become more aligned, rather than less, so I think newly allowed character sets should be the same across HTML parser and I'd quite strongly prefer that no existing HTML/XML meta characters would be newly allowed. E.g. several proposals above allow "<" as part of element names, or single quotes. The Unicode replacement character (U+FFFD) should probably be disallowed. This has caused browser bugs before. (Examples in the reference.) Not sure if this already exists, but there should probably be some language about which unicode canonicalization (not) to do, and how equality of names is determined. Ideally, this would also be aligned between HTML, JS, and CSS, where I care less about the actual rules than about whether they're the same or not. (CVE-2000-0884 is a bug at the OS level, where one part of the system canonicalizes this way, another that way.) I believe ECMAScript has fairly specific rules about this already. |
We're operating under the assumption that the HTML parser won't change so I'm not sure what your comment means. Could you elaborate? |
Even if these characters are allowed, they shouldn't really undermine security, as currently it is also possible to have them in tag names created by HTML parser. Consider the following example: const doc = new DOMParser().parseFromString(`<elem<'abc>test`, 'text/html');
const tagName = doc.body.firstChild.tagName;
// tagName is now equal to "ELEM<'ABC" Even if this HTML is serialized and reparsed, the tag would still be reparsed as |
A difference between the spec and browser implementations was raised here: https://bugs.chromium.org/p/chromium/issues/detail?id=1061436
In the spec, document.createElement should allow names that "match the Name production," which links to this definition, which allows characters in many ranges, including
[#x10000-#xEFFFF]
, which I am guessing includes emojis and would therefore allowdocument.createElement('test-\u{1f602}')
However, Chrome, Firefox, and Safari all throw exceptions when this code snippet is run.
Should the browsers update to match the spec, or should the spec be updated?
Or am I misinterpreting the spec? If the spec should be updated, what should it say instead?
The text was updated successfully, but these errors were encountered: