should we attempt to URL canonicalize patterns? #33

wanderview · 2020-12-09T16:45:38Z

The URL() constructor does a number of things to canonicalize the input string. For example:

Illegal characters are automatically URL encoded. For most components this is percent encoding, but for hostname it appears to use IDNA encoding, etc.
Some characters, like {, normally get percent encoded by URL encoding, but URLPattern uses it as one of its special characters.
For some protocols (http/https/etc) the pathname is required to consist of at least /. The slash is added if missing.
For some protocols (http/https/etc) the pathname is flattened to remove .. and ..
If the port matches the default port for a given protocol, then it is coerced to the empty string.
Probably many others...

So the question is can we and should we apply these transformations to component patterns. My original intent was to try to do so, but I have identified a number of problems:

Often the transformations are not safe within the custom regexp pattern groupings. For example, if you try to percent encode a unicode character within a regexp [a-z] character list it won't work correctly since each character in the percent encoding is considered independently. Similar difficulties arise with the .. and . flattening since those characters may appear within a regexp. In general we don't want to have to duplicate the regexp parser in URLPattern, so solving this seems quite difficult.
A URLPattern may often not have a fixed protocol. This makes it difficult to apply transformations that are conditional on protocol.
Developers may confused that some characters are URL encoded, but others are not.

Therefore I'm leaning towards not applying any canonicalization or automatic encoding for patterns. For example, if a non-ascii character is included in the pattern we would throw. In contrast, however, URL values passed to test() and exec() would be fully URL canonicalized. Therefore developers would be required to write patterns to match canonical URLs, but we would not fully enforce or automatically help with that at URLPattern construction time.

I intend to implement the above to start, but if it becomes a problem we could fall back to the a half measure. The URL canonicalization could be applied to patterns, but only outside of custom regexp groups. If you include a unicode character within a custom regexp then URLPattern would throw, but otherwise would be automatically URL encoded. Of course, pattern special characters like { would still need to be exempted from encoding, so it would not be quite equivalent to URL constructor behavior.

Generally it will be easier to move from the "no canonicalization" approach to "canonicalize outside of regexp" without breaking existing patterns.

Thoughts?

The text was updated successfully, but these errors were encountered:

asutherland · 2020-12-09T18:03:18Z

"no canonicalization" with aggressive throwing seems most straightforward and thereby understandable/document-able/reason-about-able.

annevk · 2020-12-09T18:45:17Z

While understandable in a way, it would make the design rather Latin-script centric. I suppose we also have that problem with the Access-Control-Allow-Origin header, but generally there are no such constraints when it comes to using URLs.

wanderview · 2020-12-09T19:53:22Z

Well, we could try the URL encoding of characters outside of the custom regexp groups. To better handle special characters like {, we could make an escaped \{ result in a percent encoded {.

It would still be a bit inconsistent and magical in places, but maybe ergonomic enough?

For first implementation, though, I'm going to start with no canonicalization in patterns.

asutherland · 2020-12-09T21:36:07Z

My motivating scenario is a developer struggling to get a given URLPattern to work but the problem is obscured because of canonicalization and the user assumes the problem is in the pattern match parts of the string rather than the constant parts because they're expecting the pattern matching magic but not the canonicalization magic. Specifically, situations like copying and pasting picking up fancy text styling, for example the pattern prefix "/raw–source" which would canonicalize to "/foo%E2%80%93bar" where "/raw-source" was what was intended.

I assume the pre-canonicalized version would never be exposed, so maybe this is fine as long as people don't use the idiom (new URLPattern(...)).test("foo") from the developer console (which would not pretty print the URLPattern and reveal the canonicalization versus using a temporary). Also, I suppose this is exactly the kind of thing where devtools would do well to detect what's going on and active some kind of specialized tool that breaks down exactly what's going on under the hood.

wanderview · 2020-12-09T21:51:42Z

Yea, I expect we will expose any canonicalized patterns on the URLPattern either directly or via an object stringifier.

wanderview · 2020-12-09T21:53:04Z

I realized one bad ergonomic result of not supporting . flattening. Many developers start relative URLs with ./. The dot ultimately gets flattened after the path is appended to the base. If patterns don't work the same way then I expect we will be creating a decent sized footgun.

wanderview · 2020-12-10T22:01:18Z

I found another canonicalization. If a port number matches the default port for a protocol, then its coerced to the empty string. I'll add this to the list at the top of this issue.

This is one we could try to handle for a pattern. If the pattern parses as the default number, then do the same. It basically would only work for number literals and would not work if any pattern syntax is included.

annevk · 2020-12-11T07:16:07Z

You can only normalize that if there is a scheme though, right? Otherwise you could only do that during matching it seems.

wanderview · 2020-12-11T16:18:50Z

Correct. Default ports would only be canonicalized to empty string if the protocol is unambiguously known. So a protocol of exactly "https", etc.

I would note this also applies to the input of test() and exec() when using structured input like #27. If you do p.test({ port: '80' }) the port is not canonicalized to default string because we don't know the protocol. I don't expect setting a port, but not a protocol, to be very common, though.

wanderview · 2020-12-16T22:13:39Z

Note, I opened a github "discussion" to collect community feedback on this issue. See #37.

wanderview · 2021-02-26T22:07:19Z

I have canonicalization prototyped for patterns now. What I settled on was running flat strings from that pattern through the browser's relevant component canonicalization routine.

For example, the pattern { pathname: '/foo/../bar/:name.html' } would canonicalize /foo/../bar, /, and .html separately. So in this example you end up with a pattern that matches a pathname /bar/something.html correctly.

It is possible for pattern authors to break this normalization by splitting strings up with groupings an what not. For example, /foo/{..}/:name.html would prevent the canonicalization routine from flatting the .. directory structure since it doesn't see the /../ sequence. While a bit unfortunate, I think its reasonable to accept this limitation and put this in the "don't do that" category.

Those kinds of corner cases aside, I think this will make it more convenient for pattern authors than having to manually do canonicalization in advance for patterns themselves.

I should also mention that "run the url component canonicalization" routine on each part of the pattern is also not quite true. I have a couple of exceptions. For example, I don't apply the part about adding a leading / that chromium's canonicalizer tries to add. I also don't do the bit for port where it flattens the default port to the empty string. We can't do that on partial parts of the input, but instead of have to apply it to the whole pattern, etc.

wanderview · 2021-02-26T22:10:47Z

Oh, a side note. One advantage of running the canonicalizer on the post-parse output of the pattern is that we can re-use the existing canonicalization routine without worrying about conflicting with special pattern syntax. All that syntax has been parsed out already. Similarly, escaped characters have either been treated specially already or flattened into the pattern string so they will be encoded if necessary; e.g. { is a special character in patterns, but \{ should be encoded as %7B.

This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. In addition, URLPattern doesn't want this behavior at all since we often don't know the protocol and some types of URLs do not require the pathname to begin with `/`. This is worked around by simply detecting the addition of the `/` and stripping it off. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321

This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceeding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321

This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321

This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702 Reviewed-by: Jeremy Roman <jbroman@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Commit-Queue: Ben Kelly <wanderview@chromium.org> Cr-Commit-Position: refs/heads/master@{#860399}

…g., a=testonly Automatic update from web-platform-tests URLPattern: Canonicalize pattern encoding. This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702 Reviewed-by: Jeremy Roman <jbromanchromium.org> Reviewed-by: Charlie Harrison <csharrisonchromium.org> Commit-Queue: Ben Kelly <wanderviewchromium.org> Cr-Commit-Position: refs/heads/master{#860399} -- wpt-commits: e0754272c378bf4fd9bc5e0d6161dc87fff8f9cf wpt-pr: 27849 UltraBlame original commit: 550960e2208324a0f4da6725d2cd61cd27c42c69

annevk · 2021-03-16T07:35:22Z

I was mainly thinking about differences in query (what gets percent-encoded) and host (uses a different parser). As for path, all URLs do ./.. flattening, but not all handle \ as opposed to / as path separator.

There's also a special URL state if the path does not start with / and the scheme is not special (cannot-be-a-base-URL). That's what javascript: and data: URLs would typically end up in.

wanderview · 2021-03-16T14:21:19Z

My intent is to have the different components invoke the corresponding url component canonicalization. So in my prototype, it does search canonicalization for the search component and host canonicalization for the hostname component. So, for example, you get punycode encoding for hostname, but percent encoding in other components.

BTW, you can play around with this in chrome canary right now with experimental web features (1) enabled.

I'll have to look more at what the special URL state does for pathname.

Edit: But canary doesn't yet let you inspect the canonicalized patterns. That should be available in the next couple days. Till then you have to test it via matching against urls that need similar canonicalization. Sorry for forgetting this.

(1): chrome://flags/#enable-experimental-web-platform-features

annevk · 2021-03-16T14:27:15Z

I think part of the problem I have is that specification-wise (and in some implementations) there is not really a concept of URL component canonicalization so it's not entirely clear to me what that means.

wanderview · 2021-03-16T14:33:31Z

Is there not a state machine for parsing urls where entering a component moves from one state to the next? My impression would be it amounts to entering the state machine with the state set to the start of that component and stopping when you get to the next component's state. But I have not gotten to looking at the spec yet, so maybe I don't fully understand the difficulties.

annevk · 2021-03-16T14:58:23Z

Yeah, if you roughly want to use the same entry points as the API that could work, though changes are likely needed as the URL parser assumes it is operating on a URL and not a URLPattern.

wanderview · 2021-03-16T18:29:19Z

I think I've run into a problem with trying to use the protocol as a switch into different pathname canonicalization rules. It seems the URL spec uses the number of slashes after the :, like ://, to determine if the url is standard or "cannot-be-a-base-url". Pathnames for the "cannot-be-a-base-url" should be canonicalized different according to:

https://url.spec.whatwg.org/#cannot-be-a-base-url-path-state

The problem is when you do new URLPattern({ protocol: 'foo', pathname: 'bar' }) we have no idea whether the author intended this to represent foo://bar or foo:bar. So we don't know which canonicalization routine to use on bar.

One option would be to provide a constructor option to choose; e.g. standard:true. We would automatically set standard:true for things using the special protocol names in the URL spec. We should automatically set it false for known protocols like javascript, data, and blob. Its unclear what we should set the default to for other unknown protocols. I guess standard:false to match the URL spec. (I chose standard:<value> here because cannotBeABaseURL:<value> seems like it would confuse developers.)

(Note, devs can also work around this problem other ways. They could pre-canonicalize their URLs manually. They could also do new URLPattern({ pathname: 'bar', baseURL: 'foo://' }) for standard urls.)

Any other ideas?

annevk · 2021-03-16T18:46:34Z

Well, foo://bar means a host of bar, right? So I think it's somewhat clear in that case. It seems likely that there are problematic cases though.

wanderview · 2021-03-16T18:48:57Z

Ok. But we can't tell the difference between foo:bar and foo://somehost/bar. We just match any hostname since it gets a pattern of * by default.

annevk · 2021-03-16T18:52:13Z

I guess you have to decide if host:* includes null or not. If it does not, then host:null might be a way to disambiguate. (Or the default is "any-or-null" and in that case it just won't construct and you have to pick either "any" or "null".)

wanderview · 2021-03-16T19:12:12Z

We don't distinguish null from empty string. My testing of new URL('foo:bar').hostname is consistent with that.

I guess we could make new URLPattern({ protocol: 'foo', hostname: null, pathname: 'bar' }) change the canonicalization of bar, but I think it might be easier for developers to write standard: false than hostname: null. It avoids having to understand that a null hostname somehow implies canonicalization of other parts of the URL. A standard:<value> switch seems more explicit.

Another advantage of standard:<value> would be that it could be used to override other heuristics. For example, I am thinking of using this kind of heuristic:

If protocol pattern matches one of the standard protocols (http, https, etc) then treat as a standard url.
If protocol pattern matches 'file' then treat as a file url.
Else treat as a cannot-be-a-base-url for canonicalization purposes.

The developer could use standard:false to override rule (1) here if for example a protocol:* pattern is used, but the developer really wants cannot-be-a-base-url behavior. The developer could use standard:true to override (3) if they do want standard behavior.

wanderview · 2021-03-16T19:14:28Z

I should also note, we already treat empty/missing properties in the constructor init as the * pattern by default.

wanderview · 2021-03-16T20:45:55Z

Since I think the main concern is over-canonicalization, maybe we should instead just have a very simple canonicalize:false option instead of standard:false.

Edit: This seems like a big hammer, though.

annevk · 2021-03-17T05:50:14Z

We don't distinguish null from empty string.

Not in the getter, but we do in the serialization of the full URL. The API getters do not cover the full URL model.

I don't understand why standard is somehow clear here, that name has nothing to do with the URL model or the existing URL API?

(I could see wanting to offer static methods that end up setting various properties to other values to abstract some of this.)

domenic · 2021-03-17T14:43:30Z

+1 for standard being particularly unclear... All of this is covered by standards.

wanderview · 2021-03-17T15:13:00Z

Well, happy to bikeshed different names for an escape-hatch option if we decide we need one.

I think, though, maybe I will go back to my direction at the top of the thread. The canonicalization in URLPattern should work well for common cases, but can only be best effort because of the nature of the API.

Therefore I'm leaning towards making the heuristic avoid over-canonicalization by default without any escape-hatch option. So like the heuristic in #33 (comment) without the standard option.

wanderview · 2021-03-17T21:10:38Z

Just some more notes:

I'm not planning to do anything for c: type normalization at the front of file URLs since the canonicalization for patterns happens on partial bits of the component. Also, such url would need to be customized by the author anyway because the : needs to be escaped for a pattern.
I'm not planning to do any special canonicalization for filesystem: urls.

This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd

This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276}

This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276}

…tion based on protocol., a=testonly Automatic update from web-platform-tests URLPattern: Adjust pathname canonicalization based on protocol. This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276} -- wpt-commits: 7508c0db81216e57d0856aaf3ae5d36562313b3c wpt-pr: 28154

wanderview · 2021-08-12T19:39:00Z

This is now codified in the spec.

This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702 Reviewed-by: Jeremy Roman <jbroman@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Commit-Queue: Ben Kelly <wanderview@chromium.org> Cr-Commit-Position: refs/heads/master@{#860399} GitOrigin-RevId: d18c0e82f3c2fcf27830a3412257adae88c5c39f

This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276} GitOrigin-RevId: f8605390fff8380e590880ddd09ec8ed7b3b0ca1

wanderview mentioned this issue Dec 10, 2020

should we make empty string port values and the default port number equivalent? #31

Closed

chromium-wpt-export-bot mentioned this issue Mar 1, 2021

URLPattern: Canonicalize pattern encoding. web-platform-tests/wpt#27849

Merged

wanderview added the decided label Mar 12, 2021

chromium-wpt-export-bot mentioned this issue Mar 19, 2021

URLPattern: Adjust pathname canonicalization based on protocol. web-platform-tests/wpt#28154

Merged

wanderview closed this as completed Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should we attempt to URL canonicalize patterns? #33

should we attempt to URL canonicalize patterns? #33

wanderview commented Dec 9, 2020 •

edited

Loading

asutherland commented Dec 9, 2020

annevk commented Dec 9, 2020

wanderview commented Dec 9, 2020

asutherland commented Dec 9, 2020

wanderview commented Dec 9, 2020

wanderview commented Dec 9, 2020

wanderview commented Dec 10, 2020

annevk commented Dec 11, 2020

wanderview commented Dec 11, 2020

wanderview commented Dec 16, 2020

wanderview commented Feb 26, 2021

wanderview commented Feb 26, 2021

annevk commented Mar 16, 2021 •

edited

Loading

wanderview commented Mar 16, 2021 •

edited

Loading

annevk commented Mar 16, 2021

wanderview commented Mar 16, 2021 •

edited

Loading

annevk commented Mar 16, 2021

wanderview commented Mar 16, 2021

annevk commented Mar 16, 2021

wanderview commented Mar 16, 2021

annevk commented Mar 16, 2021 •

edited

Loading

wanderview commented Mar 16, 2021

wanderview commented Mar 16, 2021

wanderview commented Mar 16, 2021 •

edited

Loading

annevk commented Mar 17, 2021

domenic commented Mar 17, 2021

wanderview commented Mar 17, 2021

wanderview commented Mar 17, 2021

wanderview commented Aug 12, 2021

should we attempt to URL canonicalize patterns? #33

should we attempt to URL canonicalize patterns? #33

Comments

wanderview commented Dec 9, 2020 • edited Loading

asutherland commented Dec 9, 2020

annevk commented Dec 9, 2020

wanderview commented Dec 9, 2020

asutherland commented Dec 9, 2020

wanderview commented Dec 9, 2020

wanderview commented Dec 9, 2020

wanderview commented Dec 10, 2020

annevk commented Dec 11, 2020

wanderview commented Dec 11, 2020

wanderview commented Dec 16, 2020

wanderview commented Feb 26, 2021

wanderview commented Feb 26, 2021

annevk commented Mar 16, 2021 • edited Loading

wanderview commented Mar 16, 2021 • edited Loading

annevk commented Mar 16, 2021

wanderview commented Mar 16, 2021 • edited Loading

annevk commented Mar 16, 2021

wanderview commented Mar 16, 2021

annevk commented Mar 16, 2021

wanderview commented Mar 16, 2021

annevk commented Mar 16, 2021 • edited Loading

wanderview commented Mar 16, 2021

wanderview commented Mar 16, 2021

wanderview commented Mar 16, 2021 • edited Loading

annevk commented Mar 17, 2021

domenic commented Mar 17, 2021

wanderview commented Mar 17, 2021

wanderview commented Mar 17, 2021

wanderview commented Aug 12, 2021

wanderview commented Dec 9, 2020 •

edited

Loading

annevk commented Mar 16, 2021 •

edited

Loading

wanderview commented Mar 16, 2021 •

edited

Loading

wanderview commented Mar 16, 2021 •

edited

Loading

annevk commented Mar 16, 2021 •

edited

Loading

wanderview commented Mar 16, 2021 •

edited

Loading