From a6b7d24850f5d776280153682559d897d22feb3a Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Thu, 14 May 2020 12:00:52 +0200 Subject: [PATCH 1/6] Editorial: make everything use percent-encode sets This switches the URL parser's query state and the application/x-www-form-urlencoded's serializer to also use percent-encode sets. Closes #411. --- url.bs | 225 +++++++++++++++++++++++++++------------------------------ 1 file changed, 107 insertions(+), 118 deletions(-) diff --git a/url.bs b/url.bs index 48d4d553..8b4faa92 100644 --- a/url.bs +++ b/url.bs @@ -180,27 +180,56 @@ the input, and percent-decoding results in a byte sequence with less 0x25 (%) by and all code points greater than U+007E (~).

The fragment percent-encode set is the C0 control percent-encode set and -U+0020 SPACE, U+0022 ("), U+003C (<), U+003E (>), and U+0060 (`). +U+0020 SPACE, U+0022 ("), U+003C (<), U+003E (>), and U+0060 (`). + +

The query percent-encode set is the C0 control percent-encode set and +U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>). + +

The query percent-encode set cannot be defined in terms of the +fragment percent-encode set due to the omission of U+0060 (`). + +

The special-query percent-encode set is the query percent-encode set and +U+0027 (').

The path percent-encode set is the -fragment percent-encode set and U+0023 (#), U+003F (?), U+007B ({), and U+007D (}). +query percent-encode set and U+003F (?), U+0060 (`), U+007B ({), and U+007D (}).

The userinfo percent-encode set is the path percent-encode set and U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+0040 (@), U+005B ([) to U+005E (^), inclusive, and U+007C (|). -

The application/x-www-form-urlencoded format's -byte serializer and the URL parser's -query state use percent-encode directly without any of these sets. +

The application/x-www-form-urlencoded percent-encode set is the +userinfo percent-encode set and U+0021 (!), U+0024 ($) to U+0029 RIGHT PARENTHESIS, +inclusive, U+002B (+), U+002C (,), and U+007E (~). -

To UTF-8 percent-encode a -code point codePoint using a percentEncodeSet, run these steps: +

The application/x-www-form-urlencoded percent-encode set contains all code +points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_). + +

To percent-encode after encoding, given an encoding +encoding, code point codePoint, and a +percentEncodeSet, run these steps:

  1. If codePoint is not in percentEncodeSet, then return codePoint. -

  2. Let bytes be the result of running UTF-8 encode on codePoint. +

  3. Let bytes be the result of encoding codePoint using + encoding. + +

  4. +

    If bytes starts with 0x26 (&) 0x23 (#) and ends with 0x3B (;), then: + +

      +
    1. Let output be bytes, isomorphic decoded. + +

    2. Replace the first two code points of output with "%26%23". + +

    3. Replace the last code point of output with "%3B". + +

    4. Return output. +

    + +

    This can happen when encoding is not UTF-8.

  5. Let output be the empty string.

  6. @@ -210,19 +239,38 @@ U+005B ([) to U+005E (^), inclusive, and U+007C (|).
  7. Return output.

-

To UTF-8 percent-encode a string input using -a percentEncodeSet, run these steps: +

To percent-encode after encoding, given an encoding +encoding, string input, a percentEncodeSet, and a +boolean spaceAsPlus, run these steps:

  1. Let output be the empty string.

  2. -
  3. For each codePoint of input, - UTF-8 percent-encode codePoint using percentEncodeSet - and append the result to output. +

  4. +

    For each codePoint of input: + +

      +
    1. If spaceAsPlus is true and codePoint is U+0020, then append + U+002B (+) to output. + +

    2. Otherwise, run percent-encode after encoding with + encoding, codePoint, and percentEncodeSet, and append the result + to output. +

  5. Return output.

+

To UTF-8 percent-encode a +code point codePoint using a percentEncodeSet, return the result +of running percent-encode after encoding with UTF-8, +codePoint, and percentEncodeSet. + +

To UTF-8 percent-encode a string input using +a percentEncodeSet, return the result of running +percent-encode after encoding with UTF-8, input, and +percentEncodeSet. +


@@ -246,9 +294,28 @@ a percentEncodeSet, run these steps: "‽%25%2E" 0xE2 0x80 0xBD 0x25 0x2E - UTF-8 percent-encode input using the + Percent-encode after encoding with Shift_JIS, + input, and the userinfo percent-encode set + U+0020 + "%20" + + U+2261 (≡) + "%81%DF" + + U+203D (‽) + "%26%238253%3B" + + Percent-encode after encoding with Shift_JIS, input, the + userinfo percent-encode set, and true + "1+1 ≡ 2%20‽" + "1+1+%81%DF+2%20%26%238253%3B" + + UTF-8 percent-encode input using the userinfo percent-encode set - U+203D + U+2261 (≡) + "%E2%89%A1" + + U+203D (‽) "%E2%80%BD" UTF-8 percent-encode input using the @@ -2362,46 +2429,12 @@ string input, optionally with a base URL base, opti
  • If c is U+0025 (%) and remaining does not start with two ASCII hex digits, validation error. -

  • Let bytes be the result of encoding c using - encoding. - -

  • -

    If bytes starts with `&#` and ends with 0x3B (;), then: +

  • Let queryPercentEncodeSet be the special-query percent-encode set if + url is special; otherwise the query percent-encode set. -

      -
    1. Replace `&#` at the start of bytes with - `%26%23`. - -

    2. Replace 0x3B (;) at the end of bytes with `%3B`. - -

    3. Append bytes, isomorphic decoded, to url's - query. -

    - -

    This can happen when encoding code points using - a non-UTF-8 encoding. - -

  • -

    Otherwise, for each byte in bytes: - -

      -
    1. -

      If one of the following is true: - -

        -
      • byte is less than 0x21 (!) -

      • byte is greater than 0x7E (~) -

      • byte is 0x22 ("), 0x23 (#), 0x3C (<), or 0x3E (>) -

      • byte is 0x27 (') and url is special -

      - - -

      then append byte, percent-encoded, to - url's query. - -

    2. Otherwise, append a code point whose value is byte to - url's query. -

    +
  • Percent-encode after encoding, with encoding, + c, and queryPercentEncodeSet, and append the result to url's + query. @@ -2716,50 +2749,6 @@ takes a byte sequence input, and then runs these steps:

    application/x-www-form-urlencoded serializing

    -

    The -application/x-www-form-urlencoded byte serializer -takes a byte sequence input and then runs these steps: - -

      -
    1. Let output be the empty string. -

    2. -

      For each byte in input, depending on - byte: - -

      -
      0x20 (SP) -

      Append U+002B (+) to output. - -

      0x2A (*) -
      0x2D (-) -
      0x2E (.) -
      0x30 (0) to 0x39 (9) -
      0x41 (A) to 0x5A (Z) -
      0x5F (_) -
      0x61 (a) to 0x7A (z) -

      Append a code point whose value is byte to - output. - -

      Otherwise -

      Append byte, - percent-encoded, to - output. -

      -
    3. Return output. -

    - -

    The application/x-www-form-urlencoded serializer takes a list of name-value tuples tuples, optionally with an encoding @@ -2768,27 +2757,30 @@ takes a list of name-value tuples tuples, optionally with an

  • Let encoding be UTF-8. -

  • If encoding override is given, set encoding to the result of +

  • If encoding override is given, then set encoding to the result of getting an output encoding from encoding override.

  • Let output be the empty string.

  • -

    For each tuple in tuples: +

    For each tuple of tuples:

      -
    1. Let name be the result of serializing - the result of encoding tuple's name, using encoding. +

    2. Let name be the result of running + percent-encode after encoding with encoding, + tuple's name, the + application/x-www-form-urlencoded percent-encode set, and true.

    3. Let value be tuple's value.

    4. If value is a file, then set value to value's filename. -

    5. Set value to the result of serializing - the result of encoding value, using encoding. +

    6. Set value to the result of running + percent-encode after encoding with encoding, value, the + application/x-www-form-urlencoded percent-encode set, and true. -

    7. If tuple is not the first pair in tuples, then append - U+0026 (&) to output. +

    8. If tuple is not tuples[0], then append U+0026 (&) to + output.

    9. Append name, followed by U+003D (=), followed by value, to output. @@ -3179,18 +3171,13 @@ console.log(url.search); // "?a=~&b=%7E" console.log(url.searchParams.get('a')); // "~" console.log(url.searchParams.get('b')); // "~" -

      {{URLSearchParams}} objects will percent-encode: C0 controls, U+0021 (!) to - U+0029 RIGHT PARENTHESIS, inclusive, U+002B (+), U+002C (,), U+002F (/), U+003A (:) to U+0040 (@), - inclusive, U+005B ([) to U+005E (^), inclusive, U+0060 (`), and anything greater than U+007A (z). - And will encode U+0020 SPACE as U+002B (+). - - -

      Ignoring encodings (use UTF-8), {{URL/search}} will percent-encode U+0000 NULL to - U+0020 SPACE, inclusive, U+0022 ("), U+0023 (#), U+0027 (') varying on is special, - U+003C (<), U+003E (>), and anything greater than U+007E (~). - +

      {{URLSearchParams}} objects will percent-encode anything in the + application/x-www-form-urlencoded percent-encode set. And will encode + U+0020 SPACE as U+002B (+). + +

      Ignoring encodings (use UTF-8), {{URL/search}} will percent-encode anything in the + query percent-encode set or the special-query percent-encode set (depending on + whether or not the URL is special).

  • A {{URLSearchParams}} object has an associated: @@ -3430,10 +3417,12 @@ Marijn Kruisselbrink, Martin Dürst, Mathias Bynens, Matt Falkenhagen, +Matt Giuca, Michael Peick, Michael™ Smith, Michal Bukovský, Michel Suignard, +Mikaël Geljić, Noah Levitt, Peter Occil, Philip Jägenstedt, From e556c133fe5457d0a710b740c4594c0eb9c2136f Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Thu, 14 May 2020 16:09:07 +0200 Subject: [PATCH 2/6] oops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Rimas Misevičius --- url.bs | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/url.bs b/url.bs index 8b4faa92..13a8fdcf 100644 --- a/url.bs +++ b/url.bs @@ -268,8 +268,8 @@ of running percent-encode after encoding with U

    To UTF-8 percent-encode a string input using a percentEncodeSet, return the result of running -percent-encode after encoding with UTF-8, input, and -percentEncodeSet. +percent-encode after encoding with UTF-8, input, +percentEncodeSet, and false.


    From 7db5b69ae34dfe426e178e66b43cb73204c12151 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Fri, 15 May 2020 08:59:36 +0200 Subject: [PATCH 3/6] (this will fail pending Infra changes) percent-encoding after encoding is very tricky --- url.bs | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/url.bs b/url.bs index 13a8fdcf..154e7cad 100644 --- a/url.bs +++ b/url.bs @@ -202,17 +202,15 @@ U+005B ([) to U+005E (^), inclusive, and U+007C (|). userinfo percent-encode set and U+0021 (!), U+0024 ($) to U+0029 RIGHT PARENTHESIS, inclusive, U+002B (+), U+002C (,), and U+007E (~). -

    The application/x-www-form-urlencoded percent-encode set contains all code -points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_). +

    The application/x-www-form-urlencoded percent-encode set contains +all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and +U+005F (_).

    To percent-encode after encoding, given an encoding encoding, code point codePoint, and a percentEncodeSet, run these steps:

      -
    1. If codePoint is not in percentEncodeSet, then return - codePoint. -

    2. Let bytes be the result of encoding codePoint using encoding. @@ -233,8 +231,20 @@ points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.)

    3. Let output be the empty string.

    4. -
    5. For each byte of bytes, percent-encode - byte and append the result to output. +

    6. +

      For each byte of bytes: + +

        +
      1. Let codePoint be a code point whose value + is byte's value. + +

      2. Assert: percentEncodeSet includes all non-ASCII code points. + +

      3. If codePoint is not in percentEncodeSet, then append + codePoint to output. + +

      4. Otherwise, percent-encode byte and append the result to + output.

      5. Return output.

      @@ -3172,7 +3182,7 @@ console.log(url.searchParams.get('a')); // "~" console.log(url.searchParams.get('b')); // "~"

      {{URLSearchParams}} objects will percent-encode anything in the - application/x-www-form-urlencoded percent-encode set. And will encode + application/x-www-form-urlencoded percent-encode set, and will encode U+0020 SPACE as U+002B (+).

      Ignoring encodings (use UTF-8), {{URL/search}} will percent-encode anything in the From a6f62de345bd2deb78d160ebfdfec741c180fd16 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Fri, 15 May 2020 09:01:55 +0200 Subject: [PATCH 4/6] nit --- url.bs | 1 + 1 file changed, 1 insertion(+) diff --git a/url.bs b/url.bs index 154e7cad..795e3823 100644 --- a/url.bs +++ b/url.bs @@ -245,6 +245,7 @@ U+005F (_).

    7. Otherwise, percent-encode byte and append the result to output. +

  • Return output. From 22db5e2181c947b724dda4e9a1ee3dc6bba09b1e Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Wed, 20 May 2020 07:17:55 +0200 Subject: [PATCH 5/6] rename inner variable --- url.bs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/url.bs b/url.bs index 795e3823..0d1401ed 100644 --- a/url.bs +++ b/url.bs @@ -235,13 +235,13 @@ U+005F (_).

    For each byte of bytes:

      -
    1. Let codePoint be a code point whose value +

    2. Let isomorph be a code point whose value is byte's value.

    3. Assert: percentEncodeSet includes all non-ASCII code points. -

    4. If codePoint is not in percentEncodeSet, then append - codePoint to output. +

    5. If isomorph is not in percentEncodeSet, then append + isomorph to output.

    6. Otherwise, percent-encode byte and append the result to output. From 300c3c43b5c4cc0649e80fc3c29e6e3d907fda3b Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Thu, 18 Jun 2020 16:19:22 +0200 Subject: [PATCH 6/6] add example --- url.bs | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/url.bs b/url.bs index 0d1401ed..008aff56 100644 --- a/url.bs +++ b/url.bs @@ -290,8 +290,8 @@ a percentEncodeSet, return the result of running +
      Operation - Example input - Example output + Input + Output
      Percent-encode input 0x7F @@ -315,6 +315,11 @@ a percentEncodeSet, return the result of running
      U+203D (‽) "%26%238253%3B" +
      Percent-encode after encoding with ISO-2022-JP, + input, and the userinfo percent-encode set + U+00A5 (¥) + "%1B(J\%1B(B"
      Percent-encode after encoding with Shift_JIS, input, the userinfo percent-encode set, and true