-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: make everything use percent-encode sets #518
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get whatwg-url fully aligned with these changed sections if possible. (Note that it has a application/x-www-form-urlencoded
serializer/parser in https://github.com/jsdom/whatwg-url/blob/master/src/urlencoded.js.)
It seems query state with Before this update, result of encoding U+00A5 in query is: |
True, I agree. That's because the old query state logic performed encoding, then checked whether the byte was in the query percent encode set in deciding whether to percent-encode it. The new "percent-encode after encoding" logic checks whether the code point is in the percent encode set, then encodes it, then percent-encodes those bytes unconditionally. Fixing this is a bit messy. I think we need to shuffle around the "percent-encode after encoding" algorithm:
This involved a fair amount of byte/codepoint gymnastics, which is unavoidable if we want to preserve the existing behaviour (which literally runs an encoder, then represents a selection of bytes using their ASCII equivalents). I'm using the "code point whose value is byte" language, borrowed from the old query state algorithm. This change should only affect non-UTF-8 encodings. When using UTF-8, all ASCII code points encode to the corresponding ASCII byte (so either way, characters in the encode set will be encoded or not according to the percent-encode set), and all non-ASCII code points encode to a sequence of non-ASCII bytes (so either way, all of those bytes will be percent-encoded). You might note that in fact this is the only usage of percent-encode sets, which are sets of code points. It would make the above algorithm easier if we converted all the percent-encode sets into sets of bytes (e.g., "U+0023 (#), U+003F (?)" becomes "0x23 (#), 0x3F (?)"), and then "the code point whose value is byte is not in percentEncodeSet" becomes "byte is not in percentEncodeSet". But I would rather not do this, since fundamentally the percent encode sets should represent characters, not bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving some comments to address @rmisev 's feedback.
I haven't gone over the rest in fine detail, but a quick read says this looks fantastic! This makes the parser a lot cleaner (at the expense of the percent-encode logic, but I'd rather move complexity out of the parser and into the percent encoder).
<li><p>Return <var>output</var>. | ||
</ol> | ||
|
||
<p class="note no-backref">This can happen when <var>encoding</var> is not <a>UTF-8</a>. | ||
|
||
<li><p>Let <var>output</var> be the empty string.</p></li> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace this with:
<li><p>For each <var>byte</var> of <var>bytes, if <var>byte</var> is not an <a>ASCII byte</a>,
or if the code point whose value is <var>byte</var> is not in <var>percentEncodeSet</var>,
<a for=byte>percent-encode</a> <var>byte</var> and append the result to <var>output</var>.
Otherwise, append the code point whose value is <var>byte</var> to <var>output</var>.
(Possibly splitting into some nested list items.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional ASCII byte check doesn't seem to be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, anything that's "not an ASCII byte" is always going to be in every percent-encode set (because the C0 control set includes "all code points greater than U+007E (~).").
It's a little dubious to rely on this, however, especially since if we removed that check, we'd be doing these weird comparisons between non-ASCII bytes and non-ASCII code points. For example, if UTF-8 encoder gives us a byte 0xD2, we would be consulting the percent-encode set for the character U+00D2, which will "work", but it has nothing to do with that particular code point. So I put the ASCII byte check, so that then we straight-up guarantee that any non-ASCII byte will be encoded without being converted to an unrelated code point.
It might also be worth adding a note next to the percent-encode sets to say that the encoding algorithm assumes (either way we decide to do this) that all non-ASCII bytes are in every percent-encode set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, created whatwg/infra#305 to help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also, I think you got the steps the wrong way around, if it is in the set, it should be encoded, if it's not in the set, it should not be.
This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.
This would be useful for whatwg/url#518, the Encoding Standard, and also clarifies isomorphic decode/encode a bit. This also requires all code points to be denoted in the same way, using "U+".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that it's a bit strange that all callers in this spec are to the application/x-www-form-urlencoded string parser
, but the non-string parser is positioned as primary. Does anyone call the non-string parser?
Follows whatwg/url#518. This generally tries to make the code correspond more explicitly to the specification in a few ways.
I've implemented this in whatwg-url in jsdom/whatwg-url#158, so am happy to give my usual sign-off. However, that doesn't test non-UTF-8 encodings, so manual review would be needed for those code branches. |
@domenic https://fetch.spec.whatwg.org/#concept-body-package-data calls it. I think the main reason I did it this way is that it seems natural to me that for a MIME-type-based format the parser starts from bytes. Though to be fair I guess that should also mean that the serializer goes to bytes which isn't the case (though happens to match the HTML parser). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM; ready to see it merge
This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets. Closes #411.
91bead4
to
22db5e2
Compare
Isn't it better to define percent encode sets in terms of bytes (not code points)? This will simplify percent encoding algorithm a bit. |
The percent-encode algorithms take code points or strings. It seems a bit weird to contrast that with sets of bytes. But it's also not clear what would be simplified. We'd still need to turn the byte into a code point. I guess we could maybe drop the assert? |
Yes, drop assert. But maybe it is not worth changing. |
percent encoding takes a sequence of bytes and produces a sequence of (ASCII) characters |
Keep in mind that percent encoding is used outside of URL processing as a kind of content-encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'm super happy with this change!! Thanks for taking the time to rework the algorithm based on @rmisev and my comments.
Note: I still have not had time to go through with a fine tooth comb and check that the members of the query percent-encode set are exactly right (for compatibility with the old version), but I checked over the rest of the changes. I trust you :)
<li><p>Let <var>isomorph</var> be a <a for=/>code point</a> whose <a for="code point">value</a> | ||
is <var>byte</var>'s <a for=byte>value</a>. | ||
|
||
<li><p>Assert: <var>percentEncodeSet</var> includes all non-<a>ASCII code points</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Link to Assert
<td>"<code>1+1 ≡ 2%20‽</code>" | ||
<td>"<code>1+1+%81%DF+2%20%26%238253%3B</code>" | ||
<tr> | ||
<td rowspan=2><a for="code point">UTF-8 percent-encode</a> <var>input</var> using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Perhaps move the UTF-8 examples up above Shift-JIS since they are the far more common example.
@@ -246,9 +305,28 @@ a <var>percentEncodeSet</var>, run these steps: | |||
<td>"<code>‽%25%2E</code>" | |||
<td>0xE2 0x80 0xBD 0x25 0x2E | |||
<tr> | |||
<td><a for="code point">UTF-8 percent-encode</a> <var>input</var> using the | |||
<td rowspan=3><a for="code point">Percent-encode after encoding</a> with <a>Shift_JIS</a>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see @rmisev 's example added here, since it's a key case (i.e., all of these examples present would work in your old algorithm and your new one; your fixed algorithm differs only by that special case where some of the bytes that encode the code point are < 128, which seems to happen in ISO-2022-JP but not UTF-8 or Shift-JIS).
So, the example is:
- Operation: Percent-encode after encoding with ISO-2022-JP, input, and the userinfo percent-encode set
- Input: U+00A5 (¥)
- Output: "
%1B(J\%1B(B
"
This spec defines both bytes-to-ASCII and codepoints-to-ASCII percent encoding functions. Both are needed by the URL parser (the latter calls the former).
I don't believe these terms are exported. The way it is defined here is specifically for use inside the URL parser / encoder. (For example, JavaScript's Either way, I believe the set of functions defined here are adequate to serve any external use. If you want to encode a byte sequence (rather than a CP sequence), use the "percent-decode a byte sequence" algorithm above. |
Follows whatwg/url#518. This generally tries to make the code correspond more explicitly to the specification in a few ways. It doesn't handle non-UTF-8 encodings yet, though. Supersedes #152.
This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets.
Closes #411.
I'll see about writing a whatwg-url patch.
Preview | Diff