URL Normalization #23

blakeembrey · 2020-10-15T05:02:16Z

Some touchy edge-cases with pathname matching that I've come up against that are worth considering are:

Handling of repeated / in URLs
- In some cases, some services handle multiple slashes the same as a single slash (e.g. https://github.com/WICG/////urlpattern resolves just fine)
URL encoding and patch matching
- When you write a pattern, should you write the unicode character or the encoded value? I understand people would like to just write /你好 and have it work instead of writing /%E4%BD%A0%E5%A5%BD.
String normalization
- Covered in the below link and MDN there is a number of ways one can write the same unicode character. It may be worth doing string normalization in some way to avoid this edge case, though it's probably reasonable if you don't since there would be performance implications

Some of the solutions to this is covered in https://github.com/pillarjs/path-to-regexp#process-pathname, but a lot of implementations ignore the edge cases which result in inconsistent handling across users. Both 1 and 3 likely depend on how "exact" matches want to be.

The text was updated successfully, but these errors were encountered:

wanderview · 2020-10-15T15:33:20Z

I've thought briefly about the encoding question. My current plan was to automatically URL encode illegal characters passed to the URLPattern constructor like the URL constructor. If the pattern or input URL has URL encoded characters that are normally legal we wouldn't do any special logic to try to match them.

It sounds like we'll have to give this more though, though. Thanks for filing!

wanderview · 2020-11-18T20:22:30Z

@domenic and @annevk, do you have any opinions here?

Currently I plan to:

Automatically URL encoding when creating the pattern and passing strings to match().
Not perform any automatic decoding when producing matched group values.

I'm not sure what to do about repeated / characters or string normalization. My inclination is not to do anything special for these for now. What do you think?

wanderview · 2020-11-18T20:24:34Z

Note, it doesn't appear the URL spec does anything special for repeated / characters:

https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9naXRodWIuY29tL1dJQ0cvLy8vL3VybHBhdHRlcm4=&base=YWJvdXQ6Ymxhbms=

wanderview · 2020-11-18T20:30:27Z

cc @jeffposnick and @jakearchibald for developer experience input.

annevk · 2020-11-19T08:15:27Z

I suspect some of this might have to be configurable, e.g., whether %61 and a match, but it's probably good enough if there is a way to enable that in the future. Note that 你好 do not constitute illegal characters, they get percent-encoded, but are perfectly valid input. (I do think that for the URL pattern parser we should not support encodings other than UTF-8 for the query, if it ends up supporting a query.)

wanderview · 2020-11-19T15:20:23Z

Right. For input I'm planning to convert to UTF-8 and percent encode automatically like URL does.

wanderview · 2020-11-19T15:24:29Z

I guess I could have URLPattern.match() and URLPattern.test() automatically normalize the input by default. This would call String.normalize(), perform UTF8 and percent encoding, and then remove any duplicate / characters. It would be expensive, but we could offer a normalize:false override option to turn it off.

domenic · 2020-11-19T15:27:51Z

I don't think string.normalize() should be involved... that's not at the URL layer.

wanderview · 2020-11-19T15:31:21Z

I guess I'm more thinking of how can we remove footguns for people. If there are ways to have multiple encodings that produce the same output, it seems matching those consistently with a single pattern would be a nice default. As someone who mainly works in English, though, I don't know how large of a concern this is.

annevk · 2020-11-19T15:56:07Z

I would prefer starting without any normalization (other than what the URL parser does) as that would reduce the expressiveness of URLs. It seems that any such thing should be opt-in and probably be added in later versions once it's more clear what the demand is and what the demand is for.

wanderview · 2020-11-19T16:26:57Z

I guess my concern is that some of these issues may occur outside the developers control. If a user types duplicate slashes in the URL bar, then a FetchEvent.request.url will also include those duplicate slashes. A fetch handler using URLPattern that did not test against this may break unexpectedly.

In these cases, what would we tell the developer to do? Remove duplicate slashes I think. And this advice would probably apply to all fetch handlers since any user could do this.

If we would offer such blanket guidance it seems it might make sense to be the default.

annevk · 2020-11-19T16:48:15Z

It really depends on the server what any given path+query will yield. If slashes would always be de-duplicated, the URL parser should do it as that would be more indicative of them not being significant. (But it doesn't because they are.)

wanderview · 2020-11-19T16:50:57Z

@jeffposnick does workbox do anything special to handle FetchEvents with duplicate slashes in the url?

jeffposnick · 2020-11-19T16:59:47Z

does workbox do anything special to handle FetchEvents with duplicate slashes in the url?

As a convenience in workbox-routing, we automatically do new URL(fetchEvent.request.url) and pass that URL object to the matchCallback, along with the original fetchEvent and fetchEvent.request values. But no, we don't attempt to normalize the URL by removing duplicated slashes or anything else.

domenic · 2020-11-19T17:00:26Z

The following Node.js server exhibits how servers can respond differently to one versus two slashes:

'use strict';
const http = require('http');

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  
  if (req.url === '/test/') {
    res.end('One slash');
  } else if (req.url === '/test//') {
    res.end('Two slashes');
  } else {
    res.end('Another page');
  }
});

server.listen(8000, 'localhost');

Locally I can trigger different results by going to https://localhost:8000/test/ vs. https://localhost:8000/test//.

However, I tried to put this code into a Glitch, and I cannot get it to exhibit the two-slash result. My guess is that there's some load-balancer or other proxy in front of Glitch servers which normalizes away the two slashes before it reaches the Node.js layer.

(I also did a variant with /test/foo vs. /test//foo, with the same result.)

Overall this feels similar to the trailing slash question. Probably some portion of servers normalize /test and /test/ to the same thing before they reach application-specific server code. A larger portion of servers probably normalize /test/ and /test// to the same thing before they reach application-specific server code. It might be good to provide an option, but IMO the default should not perform such normalization since it's server-specific.

I'm also not sure where that option would be best located: in the URL parser, or in the URL pattern matcher. Or even in a new "URL normalizer" API.

wanderview · 2021-03-12T21:53:57Z

I think this has been mostly handled by #33. We follow URL canonicalization and don't do things like collapsing duplicate / characters or normalizing unicode string representation. I'm going to close this one in favor of #33 for now.

annevk mentioned this issue Nov 19, 2020

should URLPattern be case sensitive or insensitive #28

Closed

wanderview closed this as completed Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL Normalization #23

URL Normalization #23

blakeembrey commented Oct 15, 2020 •

edited

Loading

wanderview commented Oct 15, 2020

wanderview commented Nov 18, 2020

wanderview commented Nov 18, 2020

wanderview commented Nov 18, 2020

annevk commented Nov 19, 2020

wanderview commented Nov 19, 2020

wanderview commented Nov 19, 2020

domenic commented Nov 19, 2020

wanderview commented Nov 19, 2020

annevk commented Nov 19, 2020

wanderview commented Nov 19, 2020

annevk commented Nov 19, 2020 •

edited

Loading

wanderview commented Nov 19, 2020 •

edited

Loading

jeffposnick commented Nov 19, 2020

domenic commented Nov 19, 2020 •

edited

Loading

wanderview commented Mar 12, 2021

URL Normalization #23

URL Normalization #23

Comments

blakeembrey commented Oct 15, 2020 • edited Loading

wanderview commented Oct 15, 2020

wanderview commented Nov 18, 2020

wanderview commented Nov 18, 2020

wanderview commented Nov 18, 2020

annevk commented Nov 19, 2020

wanderview commented Nov 19, 2020

wanderview commented Nov 19, 2020

domenic commented Nov 19, 2020

wanderview commented Nov 19, 2020

annevk commented Nov 19, 2020

wanderview commented Nov 19, 2020

annevk commented Nov 19, 2020 • edited Loading

wanderview commented Nov 19, 2020 • edited Loading

jeffposnick commented Nov 19, 2020

domenic commented Nov 19, 2020 • edited Loading

wanderview commented Mar 12, 2021

blakeembrey commented Oct 15, 2020 •

edited

Loading

annevk commented Nov 19, 2020 •

edited

Loading

wanderview commented Nov 19, 2020 •

edited

Loading

domenic commented Nov 19, 2020 •

edited

Loading