Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL Normalization #23

Closed
blakeembrey opened this issue Oct 15, 2020 · 16 comments
Closed

URL Normalization #23

blakeembrey opened this issue Oct 15, 2020 · 16 comments

Comments

@blakeembrey
Copy link

blakeembrey commented Oct 15, 2020

Some touchy edge-cases with pathname matching that I've come up against that are worth considering are:

  1. Handling of repeated / in URLs
  2. URL encoding and patch matching
    • When you write a pattern, should you write the unicode character or the encoded value? I understand people would like to just write /你好 and have it work instead of writing /%E4%BD%A0%E5%A5%BD.
  3. String normalization
    • Covered in the below link and MDN there is a number of ways one can write the same unicode character. It may be worth doing string normalization in some way to avoid this edge case, though it's probably reasonable if you don't since there would be performance implications

Some of the solutions to this is covered in https://github.com/pillarjs/path-to-regexp#process-pathname, but a lot of implementations ignore the edge cases which result in inconsistent handling across users. Both 1 and 3 likely depend on how "exact" matches want to be.

@wanderview
Copy link
Member

I've thought briefly about the encoding question. My current plan was to automatically URL encode illegal characters passed to the URLPattern constructor like the URL constructor. If the pattern or input URL has URL encoded characters that are normally legal we wouldn't do any special logic to try to match them.

It sounds like we'll have to give this more though, though. Thanks for filing!

@wanderview
Copy link
Member

@domenic and @annevk, do you have any opinions here?

Currently I plan to:

  1. Automatically URL encoding when creating the pattern and passing strings to match().
  2. Not perform any automatic decoding when producing matched group values.

I'm not sure what to do about repeated / characters or string normalization. My inclination is not to do anything special for these for now. What do you think?

@wanderview
Copy link
Member

Note, it doesn't appear the URL spec does anything special for repeated / characters:

https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9naXRodWIuY29tL1dJQ0cvLy8vL3VybHBhdHRlcm4=&base=YWJvdXQ6Ymxhbms=

@wanderview
Copy link
Member

cc @jeffposnick and @jakearchibald for developer experience input.

@annevk
Copy link
Member

annevk commented Nov 19, 2020

I suspect some of this might have to be configurable, e.g., whether %61 and a match, but it's probably good enough if there is a way to enable that in the future. Note that 你好 do not constitute illegal characters, they get percent-encoded, but are perfectly valid input. (I do think that for the URL pattern parser we should not support encodings other than UTF-8 for the query, if it ends up supporting a query.)

@wanderview
Copy link
Member

Right. For input I'm planning to convert to UTF-8 and percent encode automatically like URL does.

@wanderview
Copy link
Member

I guess I could have URLPattern.match() and URLPattern.test() automatically normalize the input by default. This would call String.normalize(), perform UTF8 and percent encoding, and then remove any duplicate / characters. It would be expensive, but we could offer a normalize:false override option to turn it off.

@domenic
Copy link
Member

domenic commented Nov 19, 2020

I don't think string.normalize() should be involved... that's not at the URL layer.

@wanderview
Copy link
Member

I guess I'm more thinking of how can we remove footguns for people. If there are ways to have multiple encodings that produce the same output, it seems matching those consistently with a single pattern would be a nice default. As someone who mainly works in English, though, I don't know how large of a concern this is.

@annevk
Copy link
Member

annevk commented Nov 19, 2020

I would prefer starting without any normalization (other than what the URL parser does) as that would reduce the expressiveness of URLs. It seems that any such thing should be opt-in and probably be added in later versions once it's more clear what the demand is and what the demand is for.

@wanderview
Copy link
Member

I guess my concern is that some of these issues may occur outside the developers control. If a user types duplicate slashes in the URL bar, then a FetchEvent.request.url will also include those duplicate slashes. A fetch handler using URLPattern that did not test against this may break unexpectedly.

In these cases, what would we tell the developer to do? Remove duplicate slashes I think. And this advice would probably apply to all fetch handlers since any user could do this.

If we would offer such blanket guidance it seems it might make sense to be the default.

@annevk
Copy link
Member

annevk commented Nov 19, 2020

It really depends on the server what any given path+query will yield. If slashes would always be de-duplicated, the URL parser should do it as that would be more indicative of them not being significant. (But it doesn't because they are.)

@wanderview
Copy link
Member

wanderview commented Nov 19, 2020

@jeffposnick does workbox do anything special to handle FetchEvents with duplicate slashes in the url?

@jeffposnick
Copy link

does workbox do anything special to handle FetchEvents with duplicate slashes in the url?

As a convenience in workbox-routing, we automatically do new URL(fetchEvent.request.url) and pass that URL object to the matchCallback, along with the original fetchEvent and fetchEvent.request values. But no, we don't attempt to normalize the URL by removing duplicated slashes or anything else.

@domenic
Copy link
Member

domenic commented Nov 19, 2020

The following Node.js server exhibits how servers can respond differently to one versus two slashes:

'use strict';
const http = require('http');

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  
  if (req.url === '/test/') {
    res.end('One slash');
  } else if (req.url === '/test//') {
    res.end('Two slashes');
  } else {
    res.end('Another page');
  }
});

server.listen(8000, 'localhost');

Locally I can trigger different results by going to https://localhost:8000/test/ vs. https://localhost:8000/test//.

However, I tried to put this code into a Glitch, and I cannot get it to exhibit the two-slash result. My guess is that there's some load-balancer or other proxy in front of Glitch servers which normalizes away the two slashes before it reaches the Node.js layer.

(I also did a variant with /test/foo vs. /test//foo, with the same result.)

Overall this feels similar to the trailing slash question. Probably some portion of servers normalize /test and /test/ to the same thing before they reach application-specific server code. A larger portion of servers probably normalize /test/ and /test// to the same thing before they reach application-specific server code. It might be good to provide an option, but IMO the default should not perform such normalization since it's server-specific.

I'm also not sure where that option would be best located: in the URL parser, or in the URL pattern matcher. Or even in a new "URL normalizer" API.

@wanderview
Copy link
Member

I think this has been mostly handled by #33. We follow URL canonicalization and don't do things like collapsing duplicate / characters or normalizing unicode string representation. I'm going to close this one in favor of #33 for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants