-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should we attempt to URL canonicalize patterns? #33
Comments
"no canonicalization" with aggressive throwing seems most straightforward and thereby understandable/document-able/reason-about-able. |
While understandable in a way, it would make the design rather Latin-script centric. I suppose we also have that problem with the |
Well, we could try the URL encoding of characters outside of the custom regexp groups. To better handle special characters like It would still be a bit inconsistent and magical in places, but maybe ergonomic enough? For first implementation, though, I'm going to start with no canonicalization in patterns. |
My motivating scenario is a developer struggling to get a given URLPattern to work but the problem is obscured because of canonicalization and the user assumes the problem is in the pattern match parts of the string rather than the constant parts because they're expecting the pattern matching magic but not the canonicalization magic. Specifically, situations like copying and pasting picking up fancy text styling, for example the pattern prefix "/raw–source" which would canonicalize to "/foo%E2%80%93bar" where "/raw-source" was what was intended. I assume the pre-canonicalized version would never be exposed, so maybe this is fine as long as people don't use the idiom |
Yea, I expect we will expose any canonicalized patterns on the URLPattern either directly or via an object stringifier. |
I realized one bad ergonomic result of not supporting |
I found another canonicalization. If a port number matches the default port for a protocol, then its coerced to the empty string. I'll add this to the list at the top of this issue. This is one we could try to handle for a pattern. If the pattern parses as the default number, then do the same. It basically would only work for number literals and would not work if any pattern syntax is included. |
You can only normalize that if there is a scheme though, right? Otherwise you could only do that during matching it seems. |
Correct. Default ports would only be canonicalized to empty string if the protocol is unambiguously known. So a protocol of exactly "https", etc. I would note this also applies to the input of test() and exec() when using structured input like #27. If you do |
Note, I opened a github "discussion" to collect community feedback on this issue. See #37. |
I have canonicalization prototyped for patterns now. What I settled on was running flat strings from that pattern through the browser's relevant component canonicalization routine. For example, the pattern It is possible for pattern authors to break this normalization by splitting strings up with groupings an what not. For example, Those kinds of corner cases aside, I think this will make it more convenient for pattern authors than having to manually do canonicalization in advance for patterns themselves. I should also mention that "run the url component canonicalization" routine on each part of the pattern is also not quite true. I have a couple of exceptions. For example, I don't apply the part about adding a leading |
Oh, a side note. One advantage of running the canonicalizer on the post-parse output of the pattern is that we can re-use the existing canonicalization routine without worrying about conflicting with special pattern syntax. All that syntax has been parsed out already. Similarly, escaped characters have either been treated specially already or flattened into the pattern string so they will be encoded if necessary; e.g. |
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. In addition, URLPattern doesn't want this behavior at all since we often don't know the protocol and some types of URLs do not require the pathname to begin with `/`. This is worked around by simply detecting the addition of the `/` and stripping it off. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. In addition, URLPattern doesn't want this behavior at all since we often don't know the protocol and some types of URLs do not require the pathname to begin with `/`. This is worked around by simply detecting the addition of the `/` and stripping it off. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. In addition, URLPattern doesn't want this behavior at all since we often don't know the protocol and some types of URLs do not require the pathname to begin with `/`. This is worked around by simply detecting the addition of the `/` and stripping it off. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. In addition, URLPattern doesn't want this behavior at all since we often don't know the protocol and some types of URLs do not require the pathname to begin with `/`. This is worked around by simply detecting the addition of the `/` and stripping it off. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. In addition, URLPattern doesn't want this behavior at all since we often don't know the protocol and some types of URLs do not require the pathname to begin with `/`. This is worked around by simply detecting the addition of the `/` and stripping it off. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. In addition, URLPattern doesn't want this behavior at all since we often don't know the protocol and some types of URLs do not require the pathname to begin with `/`. This is worked around by simply detecting the addition of the `/` and stripping it off. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceeding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceeding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702 Reviewed-by: Jeremy Roman <jbroman@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Commit-Queue: Ben Kelly <wanderview@chromium.org> Cr-Commit-Position: refs/heads/master@{#860399}
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702 Reviewed-by: Jeremy Roman <jbroman@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Commit-Queue: Ben Kelly <wanderview@chromium.org> Cr-Commit-Position: refs/heads/master@{#860399}
…g., a=testonly Automatic update from web-platform-tests URLPattern: Canonicalize pattern encoding. This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702 Reviewed-by: Jeremy Roman <jbromanchromium.org> Reviewed-by: Charlie Harrison <csharrisonchromium.org> Commit-Queue: Ben Kelly <wanderviewchromium.org> Cr-Commit-Position: refs/heads/master{#860399} -- wpt-commits: e0754272c378bf4fd9bc5e0d6161dc87fff8f9cf wpt-pr: 27849 UltraBlame original commit: 550960e2208324a0f4da6725d2cd61cd27c42c69
I was mainly thinking about differences in query (what gets percent-encoded) and host (uses a different parser). As for path, all URLs do There's also a special URL state if the path does not start with |
My intent is to have the different components invoke the corresponding url component canonicalization. So in my prototype, it does search canonicalization for the search component and host canonicalization for the hostname component. So, for example, you get punycode encoding for hostname, but percent encoding in other components. BTW, you can play around with this in chrome canary right now with experimental web features (1) enabled. I'll have to look more at what the special URL state does for pathname. Edit: But canary doesn't yet let you inspect the canonicalized patterns. That should be available in the next couple days. Till then you have to test it via matching against urls that need similar canonicalization. Sorry for forgetting this. (1): chrome://flags/#enable-experimental-web-platform-features |
I think part of the problem I have is that specification-wise (and in some implementations) there is not really a concept of URL component canonicalization so it's not entirely clear to me what that means. |
Is there not a state machine for parsing urls where entering a component moves from one state to the next? My impression would be it amounts to entering the state machine with the state set to the start of that component and stopping when you get to the next component's state. But I have not gotten to looking at the spec yet, so maybe I don't fully understand the difficulties. |
Yeah, if you roughly want to use the same entry points as the API that could work, though changes are likely needed as the URL parser assumes it is operating on a URL and not a URLPattern. |
I think I've run into a problem with trying to use the protocol as a switch into different pathname canonicalization rules. It seems the URL spec uses the number of slashes after the https://url.spec.whatwg.org/#cannot-be-a-base-url-path-state The problem is when you do One option would be to provide a constructor option to choose; e.g. (Note, devs can also work around this problem other ways. They could pre-canonicalize their URLs manually. They could also do Any other ideas? |
Well, |
Ok. But we can't tell the difference between |
I guess you have to decide if host:* includes null or not. If it does not, then host:null might be a way to disambiguate. (Or the default is "any-or-null" and in that case it just won't construct and you have to pick either "any" or "null".) |
We don't distinguish I guess we could make Another advantage of
The developer could use |
I should also note, we already treat empty/missing properties in the constructor init as the |
Since I think the main concern is over-canonicalization, maybe we should instead just have a very simple Edit: This seems like a big hammer, though. |
Not in the getter, but we do in the serialization of the full URL. The API getters do not cover the full URL model. I don't understand why (I could see wanting to offer static methods that end up setting various properties to other values to abstract some of this.) |
+1 for |
Well, happy to bikeshed different names for an escape-hatch option if we decide we need one. I think, though, maybe I will go back to my direction at the top of the thread. The canonicalization in URLPattern should work well for common cases, but can only be best effort because of the nature of the API. Therefore I'm leaning towards making the heuristic avoid over-canonicalization by default without any escape-hatch option. So like the heuristic in #33 (comment) without the standard option. |
Just some more notes:
|
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276}
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276}
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276}
…tion based on protocol., a=testonly Automatic update from web-platform-tests URLPattern: Adjust pathname canonicalization based on protocol. This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276} -- wpt-commits: 7508c0db81216e57d0856aaf3ae5d36562313b3c wpt-pr: 28154
This is now codified in the spec. |
This CL adds an encoding callback to liburlpattern::Parse(). The parse will invoke the given callback for plaintext parts of the pattern to validate and encode the characters. This callback mechanism is then used to apply the chromium url canonicalization code for each component pattern. There are a couple of behaviors in the canonicalizer that do not play well with this approach that the CL works around: 1. The port canonicalizer will replace an exact default port with the empty string. Since the liburlpattern::Parse() callback is invoked for partial values this CL instead implements this canoncilization separately before pattern compilation. 2. The URL canonicalizer will prepend a leading `/` character if there isn't one. Again, this behavior does not make sense when operating on partial values. Therefore this CL exposes the internal partial path canonicalization routine so that we can use it in URLPattern. In addition, this CL removes a DCHECK from url's DoPartialPath() that asserted there was always a character preceding a dot. The DCHECK has had a runtime check checking the same behavior since 2013 so it seems safe to remove the DCHECK. And in this case we want to be able to run the canonicalize partial paths that do start with dots. The CL adds a number of additional WPT test cases validating the new canonicalization behavior. The behavior in this test has been discussed in this spec issue: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I388be5d0cc57b125d44465b283050df5ed0b5321 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2720702 Reviewed-by: Jeremy Roman <jbroman@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Commit-Queue: Ben Kelly <wanderview@chromium.org> Cr-Commit-Position: refs/heads/master@{#860399} GitOrigin-RevId: d18c0e82f3c2fcf27830a3412257adae88c5c39f
This CL refines the pathname canonicalization routines to account for differences between "standard" URLs and what chrome calls "path" URLs. "Path" URLs are referred to as "cannot-be-a-base" URLs in the spec. To choose the canonicalization routine we look at the protocol pattern or string. If the protocol matches a "standard" scheme, then we choose "standard" canonicalization. We also ensure that empty string protocol values default to "standard" canonicalization. Otherwise we use the "path" URL canonicalization which is more lenient; e.g. it makes it easier to write javascript: URL paths, etc. This CL also exposes two new functions from the url component: 1. GetStandardSchemeList() returns the list of all known standard schemes. We need this in order to match them against our protocol pattern. For example, `http{s}?` should match as a standard protocol pattern since it matches both `http` and `https`. We can't use the IsStandard() function for this so we need to expose the whole list to iterate and check against. 2. CanonicalizePathURLPath() exposes the per-component canonicalization routine for "path" URLs. Previously only a full url parsing function was exposed. Discussed in: whatwg/urlpattern#33 Bug: 1141510 Change-Id: I3176a36d1e0eb2f8a0ccdf65fde346a4a623f9dd Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2769358 Commit-Queue: Ben Kelly <wanderview@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Jeremy Roman <jbroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#865276} GitOrigin-RevId: f8605390fff8380e590880ddd09ec8ed7b3b0ca1
The URL() constructor does a number of things to canonicalize the input string. For example:
{
, normally get percent encoded by URL encoding, but URLPattern uses it as one of its special characters./
. The slash is added if missing...
and.
.So the question is can we and should we apply these transformations to component patterns. My original intent was to try to do so, but I have identified a number of problems:
[a-z]
character list it won't work correctly since each character in the percent encoding is considered independently. Similar difficulties arise with the..
and.
flattening since those characters may appear within a regexp. In general we don't want to have to duplicate the regexp parser in URLPattern, so solving this seems quite difficult.Therefore I'm leaning towards not applying any canonicalization or automatic encoding for patterns. For example, if a non-ascii character is included in the pattern we would throw. In contrast, however, URL values passed to
test()
andexec()
would be fully URL canonicalized. Therefore developers would be required to write patterns to match canonical URLs, but we would not fully enforce or automatically help with that at URLPattern construction time.I intend to implement the above to start, but if it becomes a problem we could fall back to the a half measure. The URL canonicalization could be applied to patterns, but only outside of custom regexp groups. If you include a unicode character within a custom regexp then URLPattern would throw, but otherwise would be automatically URL encoded. Of course, pattern special characters like
{
would still need to be exempted from encoding, so it would not be quite equivalent to URL constructor behavior.Generally it will be easier to move from the "no canonicalization" approach to "canonicalize outside of regexp" without breaking existing patterns.
Thoughts?
The text was updated successfully, but these errors were encountered: