Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some WPT url tests failing due to invalid surrogates in UTF8 #3554

Merged
merged 2 commits into from
Feb 27, 2025

Conversation

npaun
Copy link
Member

@npaun npaun commented Feb 14, 2025

  • Adds jsg::USVString, which is guaranteed not to contain invalid characters in UTF8. This type is intended to make it convenient and consistent to implement Web APIs which say they take a USVString without having to deal with this issue every time.
  • Updates url-standard to use USVStrings as specified by the WHATWG URL spec.

@npaun npaun requested review from jasnell and anonrig February 14, 2025 17:32
@npaun npaun requested review from a team as code owners February 14, 2025 17:32
@anonrig anonrig requested a review from kentonv February 14, 2025 17:33
@anonrig
Copy link
Member

anonrig commented Feb 14, 2025

Why can't we change kj::String to use correct utf-8 values? wouldn't it be safer? @kentonv @jasnell

@npaun
Copy link
Member Author

npaun commented Feb 14, 2025

@anonrig Some APIs specify that a value is a DOMString which allows those invalid characters.

@anonrig
Copy link
Member

anonrig commented Feb 14, 2025

@anonrig Some APIs specify that a value is a DOMString which allows those invalid characters.

I'm thinking out loud:

Maybe we can tackle the problem from a different perspective. Rather than allowing invalid characters for all inputs, we can just disallow by default, and introduce DOMString() rather than USVString().

Allowing invalid utf8 values by default can cause security issues.

@kentonv
Copy link
Member

kentonv commented Feb 14, 2025

Why can't we change kj::String to use correct utf-8 values? wouldn't it be safer?

kj::String accepts whatever you give it. It contains an array of chars. By convention it is supposed to contain UTF-8 chars, but it is not kj::String's job to validate this. Validating in kj::String would be a bad idea because:

  1. It would be incredibly slow as kj::String is used everywhere, so presumably any validation checks would happen everywhere.
  2. Over-validation can lead to serious problems. For example, most systems that are based around UTF-16 do not validate surrogate pairs. This includes Java, JavaScript, Windows NT, etc. On Windows in particular, filenames can contain invalid surrogate pairs. If kj::String aggressively validated surrogate pairs, then people would be unable to use the KJ filesystem APIs to list and manipulate filenames on Windows that contain invalid surrogate pairs. This would be bad! In fact, this could be a DoS attack vector: drop a file with a bad name in the right place and break any KJ program trying to look at that directory. Yes that's right: aggressive validation can create security problems; it is not safer.

Copy link
Member

@kentonv kentonv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is one of those things that nobody cares about except for some pedantic test-writers. Do we really have to do this? What if we just didn't?

@npaun
Copy link
Member Author

npaun commented Feb 14, 2025

@kentonv The main motivation I had for fixing this is that the underlying library we use for URL stuff, ada-url, says the caller is responsible for never passing invalid UTF-8 characters.

I feel like this test is pretty pedantic but it does seem a bit unpleasant that we'd parse a URL differently than a browser or Node.

@kentonv
Copy link
Member

kentonv commented Feb 14, 2025

ada-url, says the caller is responsible for never passing invalid UTF-8 characters.

I think we need to distinguish between the specific case of invalid surrogate pairs, vs. lower-level invalid UTF-8 byte sequences.

As it stands today, JavaScript strings are UTF-16 -- or more accurately WTF-16, in that they may have unpaired surrogates. We convert them to UTF-8 ourselves; we don't accept UTF-8 from the application. So the only way that our UTF-8 might be "invalid" is specifically unpaired surrogates. Does ada-url barf on unpaired surrogates? Or does it just treat them like any regular 16-bit code point? I would expect the latter, in which case there's no real problem here?

@jasnell
Copy link
Member

jasnell commented Feb 15, 2025

... kj::String accepts whatever you give it. It contains an array of chars. By convention it is supposed to contain UTF-8 chars, but it is not kj::String's job to validate this

Couldn't we have a variation of the kj::String construction that performs the validation and replacement? It wouldn't incur the additional costs and would be explicitly opt in.

auto wellformed = kj::usvstr("...");

auto notWellformed = kj::str("...");
auto wellformed = kj::usvstr(kj::mv(notWellformed));

(the "WellFormed" terminology is borrowed by the recent API additions in JS to accomplish the same)

@jasnell
Copy link
Member

jasnell commented Feb 15, 2025

... On Windows in particular, filenames can contain invalid surrogate pairs. If kj::String aggressively validated surrogate pairs, then people would be unable to use the KJ filesystem APIs to list and manipulate filenames on Windows that contain invalid surrogate pairs

Side note... stuff like this is why Node.js' file system apis accept string or Buffer for filenames. Years ago I had a consulting client in Japan who was hitting up against the fact that their filesystem contained files and directory names that used multiple text encodings (since on some OS's file paths are just byte sequences). Node.js originally treated all file paths as UTF8. Sometimes we do have to just let invalid sequences be invalid sequences without validation or replacement

@npaun
Copy link
Member Author

npaun commented Feb 21, 2025

...

@npaun
Copy link
Member Author

npaun commented Feb 24, 2025

Hey guys, here's an update on this PR:

  • Creates USVString and DOMString to allow code to explicitly specify what behaviour it intends.
  • Reverted the behaviour where jsg::Dict<K = USVString> would silently ignore duplicate keys. Instead jsg::Dict is always a list of pairs and the caller must decide what to do in this case.

I propose we merge this PR at this point and follow-on with further PRs.

Here's what's left to discuss and implement:

  • Add a compat flag to make kj::String act as a USVString.
  • Fix the inbound case where V8 replaces a unpaired surrogate with three replacement chars (this might involve fixing stuff on their side).

@npaun npaun requested review from jasnell, anonrig and kentonv February 24, 2025 16:04
@npaun npaun requested a review from jasnell February 26, 2025 23:22
@npaun
Copy link
Member Author

npaun commented Feb 26, 2025

@jasnell It's ready now.

@kentonv kentonv dismissed their stale review February 27, 2025 20:04

I defer to @jasnell to review and decide if we need a compat flag or not.

@jasnell
Copy link
Member

jasnell commented Feb 27, 2025

I don't think a compat flag is necessary in this case.. other places we likely need to be more careful of but given the fairly extensive testing we were already doing with URL I think we're safe on this one. Updating other existing kj::String uses, however, will be a different story.

@npaun npaun closed this Feb 27, 2025
@npaun npaun reopened this Feb 27, 2025
@npaun npaun enabled auto-merge (squash) February 27, 2025 20:54
@npaun npaun merged commit 37ca2aa into main Feb 27, 2025
28 of 30 checks passed
@npaun npaun deleted the npaun/usvstring branch February 27, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants