Editorial: make everything use percent-encode sets #518

annevk · 2020-05-14T10:01:20Z

This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets.

Closes #411.

I'll see about writing a whatwg-url patch.

Preview | Diff

url.bs

domenic

I'd like to get whatwg-url fully aligned with these changed sections if possible. (Note that it has a application/x-www-form-urlencoded serializer/parser in https://github.com/jsdom/whatwg-url/blob/master/src/urlencoded.js.)

url.bs

rmisev · 2020-05-14T18:26:36Z

It seems query state with ISO-2022-JP encoding will not work well after this update. For example ISO-2022-JP encodes U+00A5 to bytes: 0x1B 0x28 0x4A 0x5C 0x1B 0x28 0x42.

Before this update, result of encoding U+00A5 in query is: %1B(J\%1B(B
After this update - result is: %1B%28%4A%5C%1B%28%42

mgiuca · 2020-05-15T04:36:13Z

It seems query state with ISO-2022-JP encoding will not work well after this update. For example ISO-2022-JP encodes U+00A5 to bytes: 0x1B 0x28 0x4A 0x5C 0x1B 0x28 0x42.

Before this update, result of encoding U+00A5 in query is: %1B(J\%1B(B
After this update - result is: %1B%28%4A%5C%1B%28%42

True, I agree. That's because the old query state logic performed encoding, then checked whether the byte was in the query percent encode set in deciding whether to percent-encode it. The new "percent-encode after encoding" logic checks whether the code point is in the percent encode set, then encodes it, then percent-encodes those bytes unconditionally.

Fixing this is a bit messy. I think we need to shuffle around the "percent-encode after encoding" algorithm:

Delete 1. "If codePoint is not in percentEncodeSet, then return codePoint."
Replace 5. "For each byte of bytes, percent-encode byte and append the result to output." with "For each byte of bytes, if byte is not an ASCII byte, or if the code point whose value is byte is not in percentEncodeSet, percent-encode byte and append the result to output. Otherwise, append the code point whose value is byte to output.

This involved a fair amount of byte/codepoint gymnastics, which is unavoidable if we want to preserve the existing behaviour (which literally runs an encoder, then represents a selection of bytes using their ASCII equivalents). I'm using the "code point whose value is byte" language, borrowed from the old query state algorithm.

This change should only affect non-UTF-8 encodings. When using UTF-8, all ASCII code points encode to the corresponding ASCII byte (so either way, characters in the encode set will be encoded or not according to the percent-encode set), and all non-ASCII code points encode to a sequence of non-ASCII bytes (so either way, all of those bytes will be percent-encoded).

You might note that in fact this is the only usage of percent-encode sets, which are sets of code points. It would make the above algorithm easier if we converted all the percent-encode sets into sets of bytes (e.g., "U+0023 (#), U+003F (?)" becomes "0x23 (#), 0x3F (?)"), and then "the code point whose value is byte is not in percentEncodeSet" becomes "byte is not in percentEncodeSet". But I would rather not do this, since fundamentally the percent encode sets should represent characters, not bytes.

mgiuca

Just leaving some comments to address @rmisev 's feedback.

I haven't gone over the rest in fine detail, but a quick read says this looks fantastic! This makes the parser a lot cleaner (at the expense of the percent-encode logic, but I'd rather move complexity out of the parser and into the percent encoder).

url.bs

mgiuca · 2020-05-15T04:40:20Z

url.bs

+   <li><p>Return <var>output</var>.
+  </ol>
+
+  <p class="note no-backref">This can happen when <var>encoding</var> is not <a>UTF-8</a>.

 <li><p>Let <var>output</var> be the empty string.</p></li>



Replace this with:

<li><p>For each <var>byte</var> of <var>bytes, if <var>byte</var> is not an <a>ASCII byte</a>, or if the code point whose value is <var>byte</var> is not in <var>percentEncodeSet</var>, <a for=byte>percent-encode</a> <var>byte</var> and append the result to <var>output</var>. Otherwise, append the code point whose value is <var>byte</var> to <var>output</var>.

(Possibly splitting into some nested list items.)

The additional ASCII byte check doesn't seem to be needed?

True, anything that's "not an ASCII byte" is always going to be in every percent-encode set (because the C0 control set includes "all code points greater than U+007E (~).").

It's a little dubious to rely on this, however, especially since if we removed that check, we'd be doing these weird comparisons between non-ASCII bytes and non-ASCII code points. For example, if UTF-8 encoder gives us a byte 0xD2, we would be consulting the percent-encode set for the character U+00D2, which will "work", but it has nothing to do with that particular code point. So I put the ASCII byte check, so that then we straight-up guarantee that any non-ASCII byte will be encoded without being converted to an unrelated code point.

It might also be worth adding a note next to the percent-encode sets to say that the encoding algorithm assumes (either way we decide to do this) that all non-ASCII bytes are in every percent-encode set.

Yeah, created whatwg/infra#305 to help with this.

But also, I think you got the steps the wrong way around, if it is in the set, it should be encoded, if it's not in the set, it should not be.

This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.

This would be useful for whatwg/url#518, the Encoding Standard, and also clarifies isomorphic decode/encode a bit. This also requires all code points to be denoted in the same way, using "U+".

url.bs

domenic

I noticed that it's a bit strange that all callers in this spec are to the application/x-www-form-urlencoded string parser, but the non-string parser is positioned as primary. Does anyone call the non-string parser?

Follows whatwg/url#518. This generally tries to make the code correspond more explicitly to the specification in a few ways.

domenic · 2020-05-19T23:24:20Z

I've implemented this in whatwg-url in jsdom/whatwg-url#158, so am happy to give my usual sign-off. However, that doesn't test non-UTF-8 encodings, so manual review would be needed for those code branches.

annevk · 2020-05-20T05:26:26Z

@domenic https://fetch.spec.whatwg.org/#concept-body-package-data calls it. I think the main reason I did it this way is that it seems natural to me that for a MIME-type-based format the parser starts from bytes. Though to be fair I guess that should also mean that the serializer goes to bytes which isn't the case (though happens to match the HTML parser).

domenic

This LGTM; ready to see it merge

This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets. Closes #411.

Co-authored-by: Rimas Misevičius <rmisev3@gmail.com>

…g is very tricky

annevk · 2020-06-16T06:10:29Z

@rmisev @mgiuca any final thoughts?

rmisev · 2020-06-16T14:10:24Z

Isn't it better to define percent encode sets in terms of bytes (not code points)? This will simplify percent encoding algorithm a bit.

annevk · 2020-06-16T14:18:23Z

The percent-encode algorithms take code points or strings. It seems a bit weird to contrast that with sets of bytes. But it's also not clear what would be simplified. We'd still need to turn the byte into a code point. I guess we could maybe drop the assert?

rmisev · 2020-06-16T15:52:59Z

Yes, drop assert. But maybe it is not worth changing.

masinter · 2020-06-16T16:25:46Z

percent encoding takes a sequence of bytes and produces a sequence of (ASCII) characters
A charset like UTF-8 or ISO8859-1 or shift-jis maps a sequence of characters to a sequence of bytes.
no codepoints for shift-jis, for example.

masinter · 2020-06-16T17:31:15Z

Keep in mind that percent encoding is used outside of URL processing as a kind of content-encoding.

mgiuca

LGTM.

I'm super happy with this change!! Thanks for taking the time to rework the algorithm based on @rmisev and my comments.

Note: I still have not had time to go through with a fine tooth comb and check that the members of the query percent-encode set are exactly right (for compatibility with the old version), but I checked over the rest of the changes. I trust you :)

mgiuca · 2020-06-17T07:27:26Z

url.bs

+   <li><p>Let <var>isomorph</var> be a <a for=/>code point</a> whose <a for="code point">value</a>
+   is <var>byte</var>'s <a for=byte>value</a>.
+
+   <li><p>Assert: <var>percentEncodeSet</var> includes all non-<a>ASCII code points</a>.


Nit: Link to Assert

mgiuca · 2020-06-17T07:42:08Z

url.bs

+   <td>"<code>1+1 ≡ 2%20‽</code>"
+   <td>"<code>1+1+%81%DF+2%20%26%238253%3B</code>"
+  <tr>
+   <td rowspan=2><a for="code point">UTF-8 percent-encode</a> <var>input</var> using the


Nit: Perhaps move the UTF-8 examples up above Shift-JIS since they are the far more common example.

mgiuca · 2020-06-17T07:44:50Z

url.bs

@@ -246,9 +305,28 @@ a <var>percentEncodeSet</var>, run these steps:
   <td>"<code>‽%25%2E</code>"
   <td>0xE2 0x80 0xBD 0x25 0x2E
  <tr>
-   <td><a for="code point">UTF-8 percent-encode</a> <var>input</var> using the
+   <td rowspan=3><a for="code point">Percent-encode after encoding</a> with <a>Shift_JIS</a>,


I'd like to see @rmisev 's example added here, since it's a key case (i.e., all of these examples present would work in your old algorithm and your new one; your fixed algorithm differs only by that special case where some of the bytes that encode the code point are < 128, which seems to happen in ISO-2022-JP but not UTF-8 or Shift-JIS).

So, the example is:

Operation: Percent-encode after encoding with ISO-2022-JP, input, and the userinfo percent-encode set

Input: U+00A5 (¥)

Output: "%1B(J\%1B(B"

mgiuca · 2020-06-17T07:56:22Z

@masinter :

percent encoding takes a sequence of bytes and produces a sequence of (ASCII) characters

This spec defines both bytes-to-ASCII and codepoints-to-ASCII percent encoding functions. Both are needed by the URL parser (the latter calls the former).

Keep in mind that percent encoding is used outside of URL processing as a kind of content-encoding.

I don't believe these terms are exported. The way it is defined here is specifically for use inside the URL parser / encoder. (For example, JavaScript's encodeURIComponent function defines its own encoding that we believe is compatible with this one, but that doesn't call into here.)

Either way, I believe the set of functions defined here are adequate to serve any external use. If you want to encode a byte sequence (rather than a CP sequence), use the "percent-decode a byte sequence" algorithm above.

Follows whatwg/url#518. This generally tries to make the code correspond more explicitly to the specification in a few ways. It doesn't handle non-UTF-8 encodings yet, though. Supersedes #152.

annevk requested review from domenic, mgiuca and rmisev May 14, 2020 10:01

annevk mentioned this pull request May 14, 2020

Refactor parse query and percent-encode sets jsdom/whatwg-url#152

Closed

rmisev reviewed May 14, 2020

View reviewed changes

url.bs Outdated Show resolved Hide resolved

domenic reviewed May 14, 2020

View reviewed changes

url.bs Outdated Show resolved Hide resolved

url.bs Outdated Show resolved Hide resolved

mgiuca reviewed May 15, 2020

View reviewed changes

annevk added a commit to whatwg/infra that referenced this pull request May 15, 2020

Give a name to byte and code point's underlying number

459d58e

This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.

annevk mentioned this pull request May 15, 2020

Give a name to byte and code point's underlying number whatwg/infra#305

Merged

domenic reviewed May 19, 2020

View reviewed changes

url.bs Outdated Show resolved Hide resolved

domenic reviewed May 19, 2020

View reviewed changes

domenic added a commit to jsdom/whatwg-url that referenced this pull request May 19, 2020

Spec update: make everything use percent-encode sets

a4812b6

Follows whatwg/url#518. This generally tries to make the code correspond more explicitly to the specification in a few ways.

domenic mentioned this pull request May 19, 2020

Spec update: make everything use percent-encode sets jsdom/whatwg-url#158

Merged

annevk mentioned this pull request Jun 2, 2020

Percent encoding does not round-trip #523

Closed

domenic approved these changes Jun 15, 2020

View reviewed changes

annevk and others added 5 commits June 16, 2020 08:04

Editorial: make everything use percent-encode sets

a6b7d24

This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets. Closes #411.

oops

e556c13

Co-authored-by: Rimas Misevičius <rmisev3@gmail.com>

(this will fail pending Infra changes) percent-encoding after encodin…

7db5b69

…g is very tricky

nit

a6f62de

rename inner variable

22db5e2

annevk force-pushed the annevk/more-percent-encode-sets branch from 91bead4 to 22db5e2 Compare June 16, 2020 06:05

mgiuca approved these changes Jun 17, 2020

View reviewed changes

add example

300c3c4

annevk merged commit e5334dd into master Jun 24, 2020

annevk deleted the annevk/more-percent-encode-sets branch June 24, 2020 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: make everything use percent-encode sets #518

Editorial: make everything use percent-encode sets #518

annevk commented May 14, 2020 •

edited by pr-preview bot

Loading

domenic left a comment

rmisev commented May 14, 2020

mgiuca commented May 15, 2020

mgiuca left a comment

mgiuca May 15, 2020

annevk May 15, 2020

mgiuca May 15, 2020

annevk May 15, 2020

annevk May 15, 2020

domenic left a comment

domenic commented May 19, 2020

annevk commented May 20, 2020

domenic left a comment

annevk commented Jun 16, 2020

rmisev commented Jun 16, 2020

annevk commented Jun 16, 2020

rmisev commented Jun 16, 2020

masinter commented Jun 16, 2020

masinter commented Jun 16, 2020

mgiuca left a comment

mgiuca Jun 17, 2020

mgiuca Jun 17, 2020

mgiuca Jun 17, 2020

mgiuca commented Jun 17, 2020

Editorial: make everything use percent-encode sets #518

Editorial: make everything use percent-encode sets #518

Conversation

annevk commented May 14, 2020 • edited by pr-preview bot Loading

domenic left a comment

Choose a reason for hiding this comment

rmisev commented May 14, 2020

mgiuca commented May 15, 2020

mgiuca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

domenic left a comment

Choose a reason for hiding this comment

domenic commented May 19, 2020

annevk commented May 20, 2020

domenic left a comment

Choose a reason for hiding this comment

annevk commented Jun 16, 2020

rmisev commented Jun 16, 2020

annevk commented Jun 16, 2020

rmisev commented Jun 16, 2020

masinter commented Jun 16, 2020

masinter commented Jun 16, 2020

mgiuca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgiuca commented Jun 17, 2020

annevk commented May 14, 2020 •

edited by pr-preview bot

Loading