Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: make everything use percent-encode sets #518

Merged
merged 6 commits into from
Jun 24, 2020

Conversation

annevk
Copy link
Member

@annevk annevk commented May 14, 2020

This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets.

Closes #411.

I'll see about writing a whatwg-url patch.


Preview | Diff

url.bs Outdated Show resolved Hide resolved
Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get whatwg-url fully aligned with these changed sections if possible. (Note that it has a application/x-www-form-urlencoded serializer/parser in https://github.com/jsdom/whatwg-url/blob/master/src/urlencoded.js.)

url.bs Outdated Show resolved Hide resolved
url.bs Outdated Show resolved Hide resolved
@rmisev
Copy link
Member

rmisev commented May 14, 2020

It seems query state with ISO-2022-JP encoding will not work well after this update. For example ISO-2022-JP encodes U+00A5 to bytes: 0x1B 0x28 0x4A 0x5C 0x1B 0x28 0x42.

Before this update, result of encoding U+00A5 in query is: %1B(J\%1B(B
After this update - result is: %1B%28%4A%5C%1B%28%42

@mgiuca
Copy link
Collaborator

mgiuca commented May 15, 2020

It seems query state with ISO-2022-JP encoding will not work well after this update. For example ISO-2022-JP encodes U+00A5 to bytes: 0x1B 0x28 0x4A 0x5C 0x1B 0x28 0x42.

Before this update, result of encoding U+00A5 in query is: %1B(J\%1B(B
After this update - result is: %1B%28%4A%5C%1B%28%42

True, I agree. That's because the old query state logic performed encoding, then checked whether the byte was in the query percent encode set in deciding whether to percent-encode it. The new "percent-encode after encoding" logic checks whether the code point is in the percent encode set, then encodes it, then percent-encodes those bytes unconditionally.

Fixing this is a bit messy. I think we need to shuffle around the "percent-encode after encoding" algorithm:

  • Delete 1. "If codePoint is not in percentEncodeSet, then return codePoint."
  • Replace 5. "For each byte of bytes, percent-encode byte and append the result to output." with "For each byte of bytes, if byte is not an ASCII byte, or if the code point whose value is byte is not in percentEncodeSet, percent-encode byte and append the result to output. Otherwise, append the code point whose value is byte to output.

This involved a fair amount of byte/codepoint gymnastics, which is unavoidable if we want to preserve the existing behaviour (which literally runs an encoder, then represents a selection of bytes using their ASCII equivalents). I'm using the "code point whose value is byte" language, borrowed from the old query state algorithm.

This change should only affect non-UTF-8 encodings. When using UTF-8, all ASCII code points encode to the corresponding ASCII byte (so either way, characters in the encode set will be encoded or not according to the percent-encode set), and all non-ASCII code points encode to a sequence of non-ASCII bytes (so either way, all of those bytes will be percent-encoded).

You might note that in fact this is the only usage of percent-encode sets, which are sets of code points. It would make the above algorithm easier if we converted all the percent-encode sets into sets of bytes (e.g., "U+0023 (#), U+003F (?)" becomes "0x23 (#), 0x3F (?)"), and then "the code point whose value is byte is not in percentEncodeSet" becomes "byte is not in percentEncodeSet". But I would rather not do this, since fundamentally the percent encode sets should represent characters, not bytes.

Copy link
Collaborator

@mgiuca mgiuca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving some comments to address @rmisev 's feedback.

I haven't gone over the rest in fine detail, but a quick read says this looks fantastic! This makes the parser a lot cleaner (at the expense of the percent-encode logic, but I'd rather move complexity out of the parser and into the percent encoder).

url.bs Outdated Show resolved Hide resolved
<li><p>Return <var>output</var>.
</ol>

<p class="note no-backref">This can happen when <var>encoding</var> is not <a>UTF-8</a>.

<li><p>Let <var>output</var> be the empty string.</p></li>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this with:

<li><p>For each <var>byte</var> of <var>bytes, if <var>byte</var> is not an <a>ASCII byte</a>,
or if the code point whose value is <var>byte</var> is not in <var>percentEncodeSet</var>,
<a for=byte>percent-encode</a> <var>byte</var> and append the result to <var>output</var>.
Otherwise, append the code point whose value is <var>byte</var> to <var>output</var>.

(Possibly splitting into some nested list items.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional ASCII byte check doesn't seem to be needed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, anything that's "not an ASCII byte" is always going to be in every percent-encode set (because the C0 control set includes "all code points greater than U+007E (~).").

It's a little dubious to rely on this, however, especially since if we removed that check, we'd be doing these weird comparisons between non-ASCII bytes and non-ASCII code points. For example, if UTF-8 encoder gives us a byte 0xD2, we would be consulting the percent-encode set for the character U+00D2, which will "work", but it has nothing to do with that particular code point. So I put the ASCII byte check, so that then we straight-up guarantee that any non-ASCII byte will be encoded without being converted to an unrelated code point.

It might also be worth adding a note next to the percent-encode sets to say that the encoding algorithm assumes (either way we decide to do this) that all non-ASCII bytes are in every percent-encode set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, created whatwg/infra#305 to help with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also, I think you got the steps the wrong way around, if it is in the set, it should be encoded, if it's not in the set, it should not be.

annevk added a commit to whatwg/infra that referenced this pull request May 15, 2020
This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.
annevk added a commit to whatwg/infra that referenced this pull request May 18, 2020
This would be useful for whatwg/url#518, the Encoding Standard, and also clarifies isomorphic decode/encode a bit.

This also requires all code points to be denoted in the same way, using "U+".
url.bs Outdated Show resolved Hide resolved
Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that it's a bit strange that all callers in this spec are to the application/x-www-form-urlencoded string parser, but the non-string parser is positioned as primary. Does anyone call the non-string parser?

domenic added a commit to jsdom/whatwg-url that referenced this pull request May 19, 2020
Follows whatwg/url#518.

This generally tries to make the code correspond more explicitly to the specification in a few ways.
@domenic
Copy link
Member

domenic commented May 19, 2020

I've implemented this in whatwg-url in jsdom/whatwg-url#158, so am happy to give my usual sign-off. However, that doesn't test non-UTF-8 encodings, so manual review would be needed for those code branches.

@annevk
Copy link
Member Author

annevk commented May 20, 2020

@domenic https://fetch.spec.whatwg.org/#concept-body-package-data calls it. I think the main reason I did it this way is that it seems natural to me that for a MIME-type-based format the parser starts from bytes. Though to be fair I guess that should also mean that the serializer goes to bytes which isn't the case (though happens to match the HTML parser).

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM; ready to see it merge

annevk and others added 5 commits June 16, 2020 08:04
This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets.

Closes #411.
Co-authored-by: Rimas Misevičius <rmisev3@gmail.com>
@annevk annevk force-pushed the annevk/more-percent-encode-sets branch from 91bead4 to 22db5e2 Compare June 16, 2020 06:05
@annevk
Copy link
Member Author

annevk commented Jun 16, 2020

@rmisev @mgiuca any final thoughts?

@rmisev
Copy link
Member

rmisev commented Jun 16, 2020

Isn't it better to define percent encode sets in terms of bytes (not code points)? This will simplify percent encoding algorithm a bit.

@annevk
Copy link
Member Author

annevk commented Jun 16, 2020

The percent-encode algorithms take code points or strings. It seems a bit weird to contrast that with sets of bytes. But it's also not clear what would be simplified. We'd still need to turn the byte into a code point. I guess we could maybe drop the assert?

@rmisev
Copy link
Member

rmisev commented Jun 16, 2020

Yes, drop assert. But maybe it is not worth changing.

@masinter
Copy link

percent encoding takes a sequence of bytes and produces a sequence of (ASCII) characters
A charset like UTF-8 or ISO8859-1 or shift-jis maps a sequence of characters to a sequence of bytes.
no codepoints for shift-jis, for example.

@masinter
Copy link

Keep in mind that percent encoding is used outside of URL processing as a kind of content-encoding.

Copy link
Collaborator

@mgiuca mgiuca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm super happy with this change!! Thanks for taking the time to rework the algorithm based on @rmisev and my comments.

Note: I still have not had time to go through with a fine tooth comb and check that the members of the query percent-encode set are exactly right (for compatibility with the old version), but I checked over the rest of the changes. I trust you :)

<li><p>Let <var>isomorph</var> be a <a for=/>code point</a> whose <a for="code point">value</a>
is <var>byte</var>'s <a for=byte>value</a>.

<li><p>Assert: <var>percentEncodeSet</var> includes all non-<a>ASCII code points</a>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Link to Assert

<td>"<code>1+1 ≡ 2%20‽</code>"
<td>"<code>1+1+%81%DF+2%20%26%238253%3B</code>"
<tr>
<td rowspan=2><a for="code point">UTF-8 percent-encode</a> <var>input</var> using the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Perhaps move the UTF-8 examples up above Shift-JIS since they are the far more common example.

@@ -246,9 +305,28 @@ a <var>percentEncodeSet</var>, run these steps:
<td>"<code>‽%25%2E</code>"
<td>0xE2 0x80 0xBD 0x25 0x2E
<tr>
<td><a for="code point">UTF-8 percent-encode</a> <var>input</var> using the
<td rowspan=3><a for="code point">Percent-encode after encoding</a> with <a>Shift_JIS</a>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see @rmisev 's example added here, since it's a key case (i.e., all of these examples present would work in your old algorithm and your new one; your fixed algorithm differs only by that special case where some of the bytes that encode the code point are < 128, which seems to happen in ISO-2022-JP but not UTF-8 or Shift-JIS).

So, the example is:

  • Operation: Percent-encode after encoding with ISO-2022-JP, input, and the userinfo percent-encode set
  • Input: U+00A5 (¥)
  • Output: "%1B(J\%1B(B"

@mgiuca
Copy link
Collaborator

mgiuca commented Jun 17, 2020

@masinter :

percent encoding takes a sequence of bytes and produces a sequence of (ASCII) characters

This spec defines both bytes-to-ASCII and codepoints-to-ASCII percent encoding functions. Both are needed by the URL parser (the latter calls the former).

Keep in mind that percent encoding is used outside of URL processing as a kind of content-encoding.

I don't believe these terms are exported. The way it is defined here is specifically for use inside the URL parser / encoder. (For example, JavaScript's encodeURIComponent function defines its own encoding that we believe is compatible with this one, but that doesn't call into here.)

Either way, I believe the set of functions defined here are adequate to serve any external use. If you want to encode a byte sequence (rather than a CP sequence), use the "percent-decode a byte sequence" algorithm above.

@annevk annevk merged commit e5334dd into master Jun 24, 2020
@annevk annevk deleted the annevk/more-percent-encode-sets branch June 24, 2020 11:55
domenic added a commit to jsdom/whatwg-url that referenced this pull request Jun 25, 2020
Follows whatwg/url#518.

This generally tries to make the code correspond more explicitly to the specification in a few ways. It doesn't handle non-UTF-8 encodings yet, though.

Supersedes #152.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Clarify query state percent-encoding (parsing)
5 participants