-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some WPT url tests failing due to invalid surrogates in UTF8 #3554
Conversation
npaun
commented
Feb 14, 2025
•
edited
Loading
edited
- Adds jsg::USVString, which is guaranteed not to contain invalid characters in UTF8. This type is intended to make it convenient and consistent to implement Web APIs which say they take a USVString without having to deal with this issue every time.
- Updates url-standard to use USVStrings as specified by the WHATWG URL spec.
@anonrig Some APIs specify that a value is a DOMString which allows those invalid characters. |
I'm thinking out loud: Maybe we can tackle the problem from a different perspective. Rather than allowing invalid characters for all inputs, we can just disallow by default, and introduce DOMString() rather than USVString(). Allowing invalid utf8 values by default can cause security issues. |
kj::String accepts whatever you give it. It contains an array of chars. By convention it is supposed to contain UTF-8 chars, but it is not kj::String's job to validate this. Validating in kj::String would be a bad idea because:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is one of those things that nobody cares about except for some pedantic test-writers. Do we really have to do this? What if we just didn't?
@kentonv The main motivation I had for fixing this is that the underlying library we use for URL stuff, ada-url, says the caller is responsible for never passing invalid UTF-8 characters. I feel like this test is pretty pedantic but it does seem a bit unpleasant that we'd parse a URL differently than a browser or Node. |
I think we need to distinguish between the specific case of invalid surrogate pairs, vs. lower-level invalid UTF-8 byte sequences. As it stands today, JavaScript strings are UTF-16 -- or more accurately WTF-16, in that they may have unpaired surrogates. We convert them to UTF-8 ourselves; we don't accept UTF-8 from the application. So the only way that our UTF-8 might be "invalid" is specifically unpaired surrogates. Does ada-url barf on unpaired surrogates? Or does it just treat them like any regular 16-bit code point? I would expect the latter, in which case there's no real problem here? |
Couldn't we have a variation of the
(the "WellFormed" terminology is borrowed by the recent API additions in JS to accomplish the same) |
Side note... stuff like this is why Node.js' file system apis accept string or Buffer for filenames. Years ago I had a consulting client in Japan who was hitting up against the fact that their filesystem contained files and directory names that used multiple text encodings (since on some OS's file paths are just byte sequences). Node.js originally treated all file paths as UTF8. Sometimes we do have to just let invalid sequences be invalid sequences without validation or replacement |
d4b07f6
to
1aae6a6
Compare
... |
1aae6a6
to
9fd4c88
Compare
Hey guys, here's an update on this PR:
I propose we merge this PR at this point and follow-on with further PRs. Here's what's left to discuss and implement:
|
9fd4c88
to
b92f396
Compare
@jasnell It's ready now. |
I defer to @jasnell to review and decide if we need a compat flag or not.
I don't think a compat flag is necessary in this case.. other places we likely need to be more careful of but given the fairly extensive testing we were already doing with URL I think we're safe on this one. Updating other existing |