Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does this test succeed with pathname "/foo/bar"? #240

Closed
anonrig opened this issue Dec 27, 2024 · 8 comments · Fixed by #243
Closed

Why does this test succeed with pathname "/foo/bar"? #240

anonrig opened this issue Dec 27, 2024 · 8 comments · Fixed by #243

Comments

@anonrig
Copy link
Contributor

anonrig commented Dec 27, 2024

What is the issue with the URL Pattern Standard?

Referencing a WPT test:

  {
    "pattern": [{ "pathname": "./foo/bar", "baseURL": "https://example.com" }],
    "inputs": [{ "pathname": "foo/bar", "baseURL": "https://example.com" }],
    "exactly_empty_components": [ "port" ],
    "expected_obj": {
      "pathname": "/foo/bar"
    },
    "expected_match": {
      "protocol": { "input": "https", "groups": {}},
      "hostname": { "input": "example.com", "groups": {}},
      "pathname": { "input": "/foo/bar", "groups": {}}
    }
  },

canonicalize_pathname method adds /- and removes /- from the method for several reasons. Ref: https://urlpattern.spec.whatwg.org/#canonicalize-a-pathname

Why does the expected_obj has a pathname of /foo/bar but not ./foo/bar?

cc @annevk @jeremyroman would you mind helping me understand the behavior here?

@jeremyroman
Copy link
Collaborator

jeremyroman commented Jan 2, 2025

Because ./ path components are ignored in URL path parsing, so the pathname is simply the result of resolving foo/bar relative to the / implied in the base URL, which is /foo/bar.

new URL('./foo/bar', 'https://example.com').toString()
// 'https://example.com/foo/bar'

Unless I'm misunderstanding your question?

@anonrig
Copy link
Contributor Author

anonrig commented Jan 2, 2025

@jeremyroman You're right but canonicalize pathname function prepends /- to the input and disables URL path parsing for this particular edge case.

Here's the ref for url pattern pathname canonicalize: https://urlpattern.spec.whatwg.org/#canonicalize-a-pathname

Particularly this hack should be removed:

The URL parser will automatically prepend a leading slash to the canonicalized pathname. This does not work here unfortunately. This algorithm is called for pieces of the pathname, instead of the entire pathname, when used as an encoding callback. Therefore we disable the prepending of the slash by inserting our own. An additional character is also inserted here in order to avoid inadvertantly collapsing a leading dot due to the fake leading slash being interpreted as a "/." sequence. These inserted characters are then removed from the result below.

Using URL parser example, - character is not removed:

> new URL('/-./foo/bar', 'https://example.com').toString()
'https://example.com/-./foo/bar'

@jeremyroman
Copy link
Collaborator

Right, and URLPattern matches that:

> new URLPattern('/-./foo/bar', 'https://example.com').pathname
'/-./foo/bar'

The hack adds a /- component and then removes it again afterward, to certain collapsing behavior when the canonicalization algorithm operates on a single fixed part. For example, consider new URLPattern({pathname: '/foo/:baz{bar}'}) which should match https://example.com/foo/1bar -- if bar were canonicalized to /bar then this would no longer match (but require an additional /). The trick with /- causes it to be internally canonicalized to /-bar and thus remain bar; without it, the URL parser inserts a leading slash.

@anonrig
Copy link
Contributor Author

anonrig commented Jan 3, 2025

I have some time to investigate this:

// Ref: https://github.com/whatwg/urlpattern/issues/240
TEST(wpt_urlpattern_tests, pathname_inconsistencies_test) {
  auto init = ada::url_pattern_init{};
  init.pathname = "./foo/bar";
  init.base_url = "https://example.com";
  auto url = ada::parse_url_pattern(init);
  ASSERT_TRUE(url);
  ASSERT_EQ(url->get_pathname(), "./foo/bar"); // this should be "/foo/bar" according to WPT
  SUCCEED();
}

Inside the URLPattern parser, we call process a pattern init function. Ref https://urlpattern.spec.whatwg.org/#process-a-urlpatterninit

For the given test, process a pattern init function section 17 (init pathname exists), subsection 1, sets the pathname as it is (as "./foo/bar"). Subsection 2 does not get executed because "https://example.com" does not have an opaque host. Hence, result pathname becomes init pathname.

Later it calls process pathname for init function stated in subsection 3: "Set result["pathname"] to the result of process pathname for init given result["pathname"], result["protocol"], and type." This function is defined in https://urlpattern.spec.whatwg.org/#process-pathname-for-init

Since type is "pattern", the pathname doesn't get changed.

Later we see that, canonicalize a pathname method is called: https://urlpattern.spec.whatwg.org/#canonicalize-a-pathname

This method prepends /- to the beginning because the first character is not a /. Referencing bullet point 3 inside canonicalize a pathname method. This results in URL parser to be called with /- prepended to the pathname, making the pathname . character not get removed

> new URL('/-./foo/bar', 'https://example.com').toString()
'https://example.com/-./foo/bar'

PS: @jeremyroman I reached out to you from Elements/Matrix.

@jeremyroman
Copy link
Collaborator

Ah, I see, and then you're saying that step 9 removes the first two characters, but leaves the . intact. Hmm.

@jeremyroman
Copy link
Collaborator

So, in the Chromium implementation we have a CanonicalizePartialPath method here that doesn't seem to have a direct counterpart in the WHATWG URL spec. kenchris/urlpattern-polyfill does a more direct translation of the spec, and I thought it passed the tests, too, though. Let me look at that.

@jeremyroman
Copy link
Collaborator

I think the issue may be in step 17 of process a URLPatternInit, where the spec says "baseURL has an opaque path" but I suspect should read "baseURL does not have an opaque path".

Chromium's implementation is baseURL.IsHierarchical() (which corresponds to this concept, modulo a feature change) and this check appears missing altogether from urlpattern-polyfill.

I think that makes more sense than what it currently says (opaque paths are ones where it is less sensible to try and remove ./ components).

I'm still a little suspicious about how principled what we're doing there is, but perhaps this at least explains the discrepancy.

@sisidovski
Copy link
Collaborator

sisidovski commented Jan 5, 2025

Agree to Jeremy's latest comment.

I think we can throw a TypeError if the baseURL has an opaque path in step 17 of process a URLPatternInit instead of saying "baseURL has an opaque path". This aligns with the current URL spec in https://url.spec.whatwg.org/#no-scheme-state

new URL('/foo', 'git:opaque')
> Uncaught TypeError: Failed to construct 'URL': Invalid URL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants