Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we attempt to URL canonicalize patterns? #33

Closed
wanderview opened this issue Dec 9, 2020 · 35 comments
Closed

should we attempt to URL canonicalize patterns? #33

wanderview opened this issue Dec 9, 2020 · 35 comments
Labels

Comments

@wanderview
Copy link
Member

wanderview commented Dec 9, 2020

The URL() constructor does a number of things to canonicalize the input string. For example:

  • Illegal characters are automatically URL encoded. For most components this is percent encoding, but for hostname it appears to use IDNA encoding, etc.
  • Some characters, like {, normally get percent encoded by URL encoding, but URLPattern uses it as one of its special characters.
  • For some protocols (http/https/etc) the pathname is required to consist of at least /. The slash is added if missing.
  • For some protocols (http/https/etc) the pathname is flattened to remove .. and ..
  • If the port matches the default port for a given protocol, then it is coerced to the empty string.
  • Probably many others...

So the question is can we and should we apply these transformations to component patterns. My original intent was to try to do so, but I have identified a number of problems:

  1. Often the transformations are not safe within the custom regexp pattern groupings. For example, if you try to percent encode a unicode character within a regexp [a-z] character list it won't work correctly since each character in the percent encoding is considered independently. Similar difficulties arise with the .. and . flattening since those characters may appear within a regexp. In general we don't want to have to duplicate the regexp parser in URLPattern, so solving this seems quite difficult.
  2. A URLPattern may often not have a fixed protocol. This makes it difficult to apply transformations that are conditional on protocol.
  3. Developers may confused that some characters are URL encoded, but others are not.

Therefore I'm leaning towards not applying any canonicalization or automatic encoding for patterns. For example, if a non-ascii character is included in the pattern we would throw. In contrast, however, URL values passed to test() and exec() would be fully URL canonicalized. Therefore developers would be required to write patterns to match canonical URLs, but we would not fully enforce or automatically help with that at URLPattern construction time.

I intend to implement the above to start, but if it becomes a problem we could fall back to the a half measure. The URL canonicalization could be applied to patterns, but only outside of custom regexp groups. If you include a unicode character within a custom regexp then URLPattern would throw, but otherwise would be automatically URL encoded. Of course, pattern special characters like { would still need to be exempted from encoding, so it would not be quite equivalent to URL constructor behavior.

Generally it will be easier to move from the "no canonicalization" approach to "canonicalize outside of regexp" without breaking existing patterns.

Thoughts?

@asutherland
Copy link

"no canonicalization" with aggressive throwing seems most straightforward and thereby understandable/document-able/reason-about-able.

@annevk
Copy link
Member

annevk commented Dec 9, 2020

While understandable in a way, it would make the design rather Latin-script centric. I suppose we also have that problem with the Access-Control-Allow-Origin header, but generally there are no such constraints when it comes to using URLs.

@wanderview
Copy link
Member Author

Well, we could try the URL encoding of characters outside of the custom regexp groups. To better handle special characters like {, we could make an escaped \{ result in a percent encoded {.

It would still be a bit inconsistent and magical in places, but maybe ergonomic enough?

For first implementation, though, I'm going to start with no canonicalization in patterns.

@asutherland
Copy link

My motivating scenario is a developer struggling to get a given URLPattern to work but the problem is obscured because of canonicalization and the user assumes the problem is in the pattern match parts of the string rather than the constant parts because they're expecting the pattern matching magic but not the canonicalization magic. Specifically, situations like copying and pasting picking up fancy text styling, for example the pattern prefix "/raw–source" which would canonicalize to "/foo%E2%80%93bar" where "/raw-source" was what was intended.

I assume the pre-canonicalized version would never be exposed, so maybe this is fine as long as people don't use the idiom (new URLPattern(...)).test("foo") from the developer console (which would not pretty print the URLPattern and reveal the canonicalization versus using a temporary). Also, I suppose this is exactly the kind of thing where devtools would do well to detect what's going on and active some kind of specialized tool that breaks down exactly what's going on under the hood.

@wanderview
Copy link
Member Author

Yea, I expect we will expose any canonicalized patterns on the URLPattern either directly or via an object stringifier.

@wanderview
Copy link
Member Author

I realized one bad ergonomic result of not supporting . flattening. Many developers start relative URLs with ./. The dot ultimately gets flattened after the path is appended to the base. If patterns don't work the same way then I expect we will be creating a decent sized footgun.

@wanderview
Copy link
Member Author

I found another canonicalization. If a port number matches the default port for a protocol, then its coerced to the empty string. I'll add this to the list at the top of this issue.

This is one we could try to handle for a pattern. If the pattern parses as the default number, then do the same. It basically would only work for number literals and would not work if any pattern syntax is included.

@annevk
Copy link
Member

annevk commented Dec 11, 2020

You can only normalize that if there is a scheme though, right? Otherwise you could only do that during matching it seems.

@wanderview
Copy link
Member Author

Correct. Default ports would only be canonicalized to empty string if the protocol is unambiguously known. So a protocol of exactly "https", etc.

I would note this also applies to the input of test() and exec() when using structured input like #27. If you do p.test({ port: '80' }) the port is not canonicalized to default string because we don't know the protocol. I don't expect setting a port, but not a protocol, to be very common, though.

@wanderview
Copy link
Member Author

Note, I opened a github "discussion" to collect community feedback on this issue. See #37.

@wanderview
Copy link
Member Author

I have canonicalization prototyped for patterns now. What I settled on was running flat strings from that pattern through the browser's relevant component canonicalization routine.

For example, the pattern { pathname: '/foo/../bar/:name.html' } would canonicalize /foo/../bar, /, and .html separately. So in this example you end up with a pattern that matches a pathname /bar/something.html correctly.

It is possible for pattern authors to break this normalization by splitting strings up with groupings an what not. For example, /foo/{..}/:name.html would prevent the canonicalization routine from flatting the .. directory structure since it doesn't see the /../ sequence. While a bit unfortunate, I think its reasonable to accept this limitation and put this in the "don't do that" category.

Those kinds of corner cases aside, I think this will make it more convenient for pattern authors than having to manually do canonicalization in advance for patterns themselves.

I should also mention that "run the url component canonicalization" routine on each part of the pattern is also not quite true. I have a couple of exceptions. For example, I don't apply the part about adding a leading / that chromium's canonicalizer tries to add. I also don't do the bit for port where it flattens the default port to the empty string. We can't do that on partial parts of the input, but instead of have to apply it to the whole pattern, etc.

@wanderview
Copy link
Member Author

Oh, a side note. One advantage of running the canonicalizer on the post-parse output of the pattern is that we can re-use the existing canonicalization routine without worrying about conflicting with special pattern syntax. All that syntax has been parsed out already. Similarly, escaped characters have either been treated specially already or flattened into the pattern string so they will be encoded if necessary; e.g. { is a special character in patterns, but \{ should be encoded as %7B.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 1, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  In addition, URLPattern doesn't want this
   behavior at all since we often don't know the protocol and some types
   of URLs do not require the pathname to begin with `/`.  This is
   worked around by simply detecting the addition of the `/` and
   stripping it off.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 2, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  In addition, URLPattern doesn't want this
   behavior at all since we often don't know the protocol and some types
   of URLs do not require the pathname to begin with `/`.  This is
   worked around by simply detecting the addition of the `/` and
   stripping it off.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 2, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  In addition, URLPattern doesn't want this
   behavior at all since we often don't know the protocol and some types
   of URLs do not require the pathname to begin with `/`.  This is
   worked around by simply detecting the addition of the `/` and
   stripping it off.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 3, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  In addition, URLPattern doesn't want this
   behavior at all since we often don't know the protocol and some types
   of URLs do not require the pathname to begin with `/`.  This is
   worked around by simply detecting the addition of the `/` and
   stripping it off.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 3, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  In addition, URLPattern doesn't want this
   behavior at all since we often don't know the protocol and some types
   of URLs do not require the pathname to begin with `/`.  This is
   worked around by simply detecting the addition of the `/` and
   stripping it off.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 3, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  In addition, URLPattern doesn't want this
   behavior at all since we often don't know the protocol and some types
   of URLs do not require the pathname to begin with `/`.  This is
   worked around by simply detecting the addition of the `/` and
   stripping it off.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 3, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceeding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 4, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceeding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 4, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 5, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 5, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 5, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 5, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 5, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Cr-Commit-Position: refs/heads/master@{#860399}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 5, 2021
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Cr-Commit-Position: refs/heads/master@{#860399}
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Mar 15, 2021
…g., a=testonly

Automatic update from web-platform-tests
URLPattern: Canonicalize pattern encoding.

This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702
Reviewed-by: Jeremy Roman <jbromanchromium.org>
Reviewed-by: Charlie Harrison <csharrisonchromium.org>
Commit-Queue: Ben Kelly <wanderviewchromium.org>
Cr-Commit-Position: refs/heads/master{#860399}

--

wpt-commits: e0754272c378bf4fd9bc5e0d6161dc87fff8f9cf
wpt-pr: 27849

UltraBlame original commit: 550960e2208324a0f4da6725d2cd61cd27c42c69
@annevk
Copy link
Member

annevk commented Mar 16, 2021

I was mainly thinking about differences in query (what gets percent-encoded) and host (uses a different parser). As for path, all URLs do ./.. flattening, but not all handle \ as opposed to / as path separator.

There's also a special URL state if the path does not start with / and the scheme is not special (cannot-be-a-base-URL). That's what javascript: and data: URLs would typically end up in.

@wanderview
Copy link
Member Author

wanderview commented Mar 16, 2021

My intent is to have the different components invoke the corresponding url component canonicalization. So in my prototype, it does search canonicalization for the search component and host canonicalization for the hostname component. So, for example, you get punycode encoding for hostname, but percent encoding in other components.

BTW, you can play around with this in chrome canary right now with experimental web features (1) enabled.

I'll have to look more at what the special URL state does for pathname.

Edit: But canary doesn't yet let you inspect the canonicalized patterns. That should be available in the next couple days. Till then you have to test it via matching against urls that need similar canonicalization. Sorry for forgetting this.

(1): chrome://flags/#enable-experimental-web-platform-features

@annevk
Copy link
Member

annevk commented Mar 16, 2021

I think part of the problem I have is that specification-wise (and in some implementations) there is not really a concept of URL component canonicalization so it's not entirely clear to me what that means.

@wanderview
Copy link
Member Author

wanderview commented Mar 16, 2021

Is there not a state machine for parsing urls where entering a component moves from one state to the next? My impression would be it amounts to entering the state machine with the state set to the start of that component and stopping when you get to the next component's state. But I have not gotten to looking at the spec yet, so maybe I don't fully understand the difficulties.

@annevk
Copy link
Member

annevk commented Mar 16, 2021

Yeah, if you roughly want to use the same entry points as the API that could work, though changes are likely needed as the URL parser assumes it is operating on a URL and not a URLPattern.

@wanderview
Copy link
Member Author

I think I've run into a problem with trying to use the protocol as a switch into different pathname canonicalization rules. It seems the URL spec uses the number of slashes after the :, like ://, to determine if the url is standard or "cannot-be-a-base-url". Pathnames for the "cannot-be-a-base-url" should be canonicalized different according to:

https://url.spec.whatwg.org/#cannot-be-a-base-url-path-state

The problem is when you do new URLPattern({ protocol: 'foo', pathname: 'bar' }) we have no idea whether the author intended this to represent foo://bar or foo:bar. So we don't know which canonicalization routine to use on bar.

One option would be to provide a constructor option to choose; e.g. standard:true. We would automatically set standard:true for things using the special protocol names in the URL spec. We should automatically set it false for known protocols like javascript, data, and blob. Its unclear what we should set the default to for other unknown protocols. I guess standard:false to match the URL spec. (I chose standard:<value> here because cannotBeABaseURL:<value> seems like it would confuse developers.)

(Note, devs can also work around this problem other ways. They could pre-canonicalize their URLs manually. They could also do new URLPattern({ pathname: 'bar', baseURL: 'foo://' }) for standard urls.)

Any other ideas?

@annevk
Copy link
Member

annevk commented Mar 16, 2021

Well, foo://bar means a host of bar, right? So I think it's somewhat clear in that case. It seems likely that there are problematic cases though.

@wanderview
Copy link
Member Author

Ok. But we can't tell the difference between foo:bar and foo://somehost/bar. We just match any hostname since it gets a pattern of * by default.

@annevk
Copy link
Member

annevk commented Mar 16, 2021

I guess you have to decide if host:* includes null or not. If it does not, then host:null might be a way to disambiguate. (Or the default is "any-or-null" and in that case it just won't construct and you have to pick either "any" or "null".)

@wanderview
Copy link
Member Author

We don't distinguish null from empty string. My testing of new URL('foo:bar').hostname is consistent with that.

I guess we could make new URLPattern({ protocol: 'foo', hostname: null, pathname: 'bar' }) change the canonicalization of bar, but I think it might be easier for developers to write standard: false than hostname: null. It avoids having to understand that a null hostname somehow implies canonicalization of other parts of the URL. A standard:<value> switch seems more explicit.

Another advantage of standard:<value> would be that it could be used to override other heuristics. For example, I am thinking of using this kind of heuristic:

  1. If protocol pattern matches one of the standard protocols (http, https, etc) then treat as a standard url.
  2. If protocol pattern matches 'file' then treat as a file url.
  3. Else treat as a cannot-be-a-base-url for canonicalization purposes.

The developer could use standard:false to override rule (1) here if for example a protocol:* pattern is used, but the developer really wants cannot-be-a-base-url behavior. The developer could use standard:true to override (3) if they do want standard behavior.

@wanderview
Copy link
Member Author

I should also note, we already treat empty/missing properties in the constructor init as the * pattern by default.

@wanderview
Copy link
Member Author

wanderview commented Mar 16, 2021

Since I think the main concern is over-canonicalization, maybe we should instead just have a very simple canonicalize:false option instead of standard:false.

Edit: This seems like a big hammer, though.

@annevk
Copy link
Member

annevk commented Mar 17, 2021

We don't distinguish null from empty string.

Not in the getter, but we do in the serialization of the full URL. The API getters do not cover the full URL model.

I don't understand why standard is somehow clear here, that name has nothing to do with the URL model or the existing URL API?

(I could see wanting to offer static methods that end up setting various properties to other values to abstract some of this.)

@domenic
Copy link
Member

domenic commented Mar 17, 2021

+1 for standard being particularly unclear... All of this is covered by standards.

@wanderview
Copy link
Member Author

Well, happy to bikeshed different names for an escape-hatch option if we decide we need one.

I think, though, maybe I will go back to my direction at the top of the thread. The canonicalization in URLPattern should work well for common cases, but can only be best effort because of the nature of the API.

Therefore I'm leaning towards making the heuristic avoid over-canonicalization by default without any escape-hatch option. So like the heuristic in #33 (comment) without the standard option.

@wanderview
Copy link
Member Author

Just some more notes:

  • I'm not planning to do anything for c: type normalization at the front of file URLs since the canonicalization for patterns happens on partial bits of the component. Also, such url would need to be customized by the author anyway because the : needs to be escaped for a pattern.
  • I'm not planning to do any special canonicalization for filesystem: urls.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 19, 2021
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 19, 2021
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 22, 2021
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 22, 2021
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 22, 2021
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#865276}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Mar 22, 2021
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#865276}
pull bot pushed a commit to Mu-L/chromium that referenced this issue Mar 22, 2021
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#865276}
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Mar 27, 2021
…tion based on protocol., a=testonly

Automatic update from web-platform-tests
URLPattern: Adjust pathname canonicalization based on protocol.

This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#865276}

--

wpt-commits: 7508c0db81216e57d0856aaf3ae5d36562313b3c
wpt-pr: 28154
@wanderview
Copy link
Member Author

This is now codified in the spec.

mjfroman pushed a commit to mjfroman/moz-libwebrtc-third-party that referenced this issue Oct 14, 2022
This CL adds an encoding callback to liburlpattern::Parse().  The
parse will invoke the given callback for plaintext parts of the pattern
to validate and encode the characters.  This callback mechanism is then
used to apply the chromium url canonicalization code for each component
pattern.

There are a couple of behaviors in the canonicalizer that do not play
well with this approach that the CL works around:

1. The port canonicalizer will replace an exact default port with the
   empty string.  Since the liburlpattern::Parse() callback is invoked
   for partial values this CL instead implements this canoncilization
   separately before pattern compilation.
2. The URL canonicalizer will prepend a leading `/` character if there
   isn't one.  Again, this behavior does not make sense when operating
   on partial values.  Therefore this CL exposes the internal partial
   path canonicalization routine so that we can use it in URLPattern.

In addition, this CL removes a DCHECK from url's DoPartialPath() that
asserted there was always a character preceding a dot.  The DCHECK has
had a runtime check checking the same behavior since 2013 so it seems
safe to remove the DCHECK.  And in this case we want to be able to
run the canonicalize partial paths that do start with dots.

The CL adds a number of additional WPT test cases validating the new
canonicalization behavior.

The behavior in this test has been discussed in this spec issue:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Cr-Commit-Position: refs/heads/master@{#860399}
GitOrigin-RevId: d18c0e82f3c2fcf27830a3412257adae88c5c39f
mjfroman pushed a commit to mjfroman/moz-libwebrtc-third-party that referenced this issue Oct 14, 2022
This CL refines the pathname canonicalization routines to account for
differences between "standard" URLs and what chrome calls "path" URLs.
"Path" URLs are referred to as "cannot-be-a-base" URLs in the spec.

To choose the canonicalization routine we look at the protocol pattern
or string.  If the protocol matches a "standard" scheme, then we choose
"standard" canonicalization.  We also ensure that empty string protocol
values default to "standard" canonicalization.  Otherwise we use the
"path" URL canonicalization which is more lenient; e.g. it makes it
easier to write javascript: URL paths, etc.

This CL also exposes two new functions from the url component:

1. GetStandardSchemeList() returns the list of all known standard
schemes.  We need this in order to match them against our protocol
pattern.  For example, `http{s}?` should match as a standard protocol
pattern since it matches both `http` and `https`.  We can't use the
IsStandard() function for this so we need to expose the whole list to
iterate and check against.
2. CanonicalizePathURLPath() exposes the per-component canonicalization
routine for "path" URLs.  Previously only a full url parsing function
was exposed.

Discussed in:

whatwg/urlpattern#33

Bug: 1141510
Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358
Commit-Queue: Ben Kelly <wanderview@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Jeremy Roman <jbroman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#865276}
GitOrigin-RevId: f8605390fff8380e590880ddd09ec8ed7b3b0ca1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants