-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need an "unreserved" character set (and better define how to percent-encode arbitrary strings) #369
Comments
This decoding can cause a reparse problem, see #87 (comment) :
It's problematic because alone |
Update: Someone else coincidentally filed the registerProtocolHandler issue at whatwg/html#3377.
I see. This issue is more complex than I thought (mostly because of the nested-escape issue).
Yeah, we can't do that. It would break registerProtocolHandler, which is based on the (IMHO rather flimsy) assumption that "%s" parses as "%s" (despite a validation error). I was encouraged to rely on the same mechanism in w3c/web-share-target#31. Having said that, I think we can solve it basically the way that Chrome solved it. Having proper equivalence defined for URLs is kind of important. I don't think we should jettison that concept because there is an edge case that causes trouble. Here's what I'm proposing: A "decodable percent sequence" is a "%" followed by two hex digits representing a byte in the unreserved set.
Test cases:
I think that covers it, including nested cases. What we gain from this is that we can define a set of characters that must be considered equivalent. Let me further justify why we need this.
I agree that servers (let's call them "URL processors" -- any application that breaks down a URL and uses its pieces, whether mapping it onto a file system, or otherwise) should be free to treat certain characters, such as '$', as equivalent to their encoded counterparts, or not, as they wish. What we're missing is a mandate that URL processors must treat other characters, such as 'a', as equivalent to their encoded counterparts. Let's call these two character sets "reserved" and "unreserved". Encoding or decoding a reserved character may or may not change the meaning of the URL (depending on the processor). Encoding or decoding an unreserved character does not change the meaning of the URL. These sets impact rendering and encoding as follows:
The current status quo is essentially that the "unreserved" set is the null set. This means that:
Putting those two together, a URL with "name=%4D%61%74%74" has to be rendered as "name=%4D%61%74%74", so all URLs are ugly and impossible for a human to read. Now you may be saying: "Come on, don't be so pedantic. No URL processor is going to treat "a" differently to "%61", so surely we don't need to encode it!" OK, but how do I choose which characters need to be encoded and which don't? How do I know which characters will be treated equivalently to their encoded versions, and which won't? I have no more faith that a URL processor will consider "a" and "%61" equivalent than I do for "=" and "%3D". In order to know what characters need to be encoded, the URL specification needs to explicitly state which characters a URL processor is allowed to treat specially (the reserved set) and which it isn't (the unreserved set). Corollary: If we don't define an unreserved set, we still need to define some set of characters that registerProtocolHandler (and Web Share Target) should encode. What is that set? It can't be any of the existing percent-encode sets, since they don't encode enough characters. Sure, we could throw in a few more characters like '&' and '=', but what is the right answer to "which characters need to be encoded to ensure the characters in the string aren't treated specially by the URL processor?". My search for an answer to this question led me to the conclusion that we need to bring back the RFC 3986 concept of an unreserved set. |
|
Slightly aside the point of this issue but here goes anyhow: is there any good reason to even allow %-encoding of ASCII alphanumerics? Is there actually enough legitimate usage or an otherwise-impossible scenario reliant on this feature to justify it? It seems to me like it's primarily allowing naïve filters to be bypassed, similar to overlong UTF-8 encodings -- which are thankfully banned on the web for reasons of security. Is there any reason we cannot likewise ban these? |
Right. I don't see any reason to not normalize in the query and fragment as well, and update Chrome to match. Theoretically, it shouldn't matter whether any particular URL processor normalizes "%6D" as "m" or not, because "%6D" should be considered equivalent to "m". The only problem is a technicality that equivalence is defined by the URL parser, so we need to specify that "%6D" decodes to "m" in the parser, otherwise it isn't considered equivalent.
I'll go from my most pragmatic concern to most ideological:
If you're going to go down this path, I'd want other unreserved characters (like '_' and '~') treated the same. Otherwise you create three classes of character: reserved, unreserved and non-encodable, with the same problem just for a smaller set of characters.
I can't speak to whether this would mitigate any realistic security problems. My feeling is that it's been legal to encode unreserved characters for 20 years and making it illegal now would break a countless amount of software --- especially since different encoders encode slightly different sets of characters (e.g., Python's urllib.parse.quote encodes '~', even though it's in the unreserved set, so if we made "%7E" illegal, URLs generated by Python would become illegal). (Also note that the UTF-8 standard itself banned overlong encodings from 2003 onwards; this isn't a web-specific restriction.) |
@bsittler I strongly suspect it's not web-compatible, but I welcome Chrome demonstrating otherwise. @mgiuca there are many things not entirely logical on the web, but if you can solve this particular one I'm not opposed. As I said earlier, more normalization seems nice if we can get away with it. @valenting @achristensen07 @travisleithead would you be okay with more aggressive normalization of percent-encoded bits in URLs if Chrome (or some other entity) can demonstrate it's feasible? |
Cool, I'll write a draft change but I won't polish it too much since this is still being debated. If we decide not to change normalization, I still think we need to solve the other two issues: rendering and encoding arbitrary strings. Rendering needs to state a certain set of characters to be decoded. Encoding needs to state a certain set of characters to be encoded (and, for example, registerProtocolHandler would use that set). |
My personal take is that we should work to solve these issues in order of most practical to most theoretical, with corresponding urgency. Here I am using @mgiuca's enumeration in #369 (comment). So I would suggest:
(3) is the only area where I think we would make normative changes to URL parsing, based on the principle that URLs should be equivalent if and only if they parse the same. (Which IMO is a very good principle.) @mgiuca, @annevk, does this make any sense as an approach? Although I suppose @annevk has already pinged the other implementers for their take on (3), so maybe we're just going straight for solving everything at once :) In general I appreciate @mgiuca's thinking about how this standard applies in a larger context, and think we should definitely work on incorporating such suggestions. |
@domenic Yes, that sounds like the right order of steps. (1) unblocks Web Share Target and fixes registerProtocolHandler. That's the main practical issue I'm trying to solve. (3) doesn't really have a pressing issue that I'm trying to solve, but I think it's nice to fix it. Edit: Although having said that, I'd like to come to an agreement on what the "unreserved" set would ultimately be in (3) (e.g., is it from RFC 2396, 3986, or some other set?), because that will inform the encode set in (1). |
3986 seems safest. I just realized that the problems you allude to though will continue to exist for non-ASCII data, which is why I think I gave up on pursuing something grander here since the producer and consumer will need to have some agreement at some level anyway. |
The only problem is (as I said in my initial "essay"), that many "arbitrary string encoders" (including encodeURIComponent and both Chrome and Firefox' implementations of registerProtocolHandler) use the RFC 2396 set. I think we can work with either set, since the delta between them ( (Note that the default-encode set should include all reserved characters, but it's OK for it to be a superset of the reserved characters, and thus unnecessarily encode some unreserved characters. So I think it's safer actually to have a larger unreserved set from 2396.)
Actually, non-ASCII data is a non-issue. Both the current URL Standard, and RFC 3987, treats any non-ASCII character equivalently to its encoded form (by virtue of normalizing them to percent-encoded form). The same is true of all characters in the C0 control set, which are normalized to encoded form. Any character that is normalized either to encoded or non-encoded form does not trigger any of the above issues. It doesn't matter if such a character is rendered encoded or non-encoded, because it has the same meaning. It doesn't matter if such a character is encoded by an "arbitrary string encoder" or not, because it has the same meaning. So as far as I can tell, this whole issue revolves around ASCII characters outside of the C0 control set, which are not normalized one way or the other. |
Well, it matters in the same sense I think as you don't know whether As for what the unreserved set should be, if you make it larger, don't you encroach on what 3986 considers reserved and for which changing the encoding would change the meaning? Aligning with JavaScript's notion seems nice too though, in a way. I guess I don't feel strongly. |
While RFC 3986 (from 2005) has tried to re-reserve |
I agree with @LEW21. By making the reserved set match RFC 2396, we would be undoing the change of the newer RFC, but I believe that would align with the WHATWG mission of describing things as they are, not how they "should be". Most software that I've seen matches RFC 2396, including common programming language libraries. (Of course, there will be software that follows RFC 3986 also. It's a tough decision.) |
Just trying a couple, JS seems very much in the minority here... Python: >>> urllib.parse.quote("!'()*")
'%21%27%28%29%2A' PHP: urlencode("!'()*");
=> "%21%27%28%29%2A" Perl: use URI::Escape;
print(uri_escape("!'()*"));
%21%27%28%29%2A Ruby: require "erb"
include ERB::Util
puts url_encode("!'()*")
%21%27%28%29%2A Go: package main
import (
"fmt";
"net/url";
)
func main() {
fmt.Println(url.PathEscape("!'()*"))
}
%21%27%28%29%2A
Program exited. Java: import java.net.URLEncoder
String e = URLEncoder.encode("!'()*", "UTF-8")
System.out.println(e)
%21%27%28%29* |
Hmm, I didn't know all of those @mnot -- I thought Java at least was based on the old standard. It looks like Java is based on x-www-form-urlencoded which is a different set again (it doesn't encode The thing is though, that it's safer to leave characters in the unreserved set, as long as they aren't used for any syntax. That way, encoders are free to encode them, or not, as they choose, without changing the semantics. If we chose If we choose That's why I suggested a possibly even wider set: |
Python does not match any RFC - it encodes quote('-._~') # -._%7E
quote('/') # / I want to make a merge request to Python after this spec decides on what to do - but no idea if they will agree to change the behavior, or to add a new function. PHP encodes echo urlencode('-._~'); # -._%7E
echo urlencode('~'); # %2F Perl: use URI::Escape;
print(uri_escape("-._~")); # -._~
print(uri_escape("/")); # %2F Ruby: require "erb"
include ERB::Util
puts url_encode("-._~") # -._~
puts url_encode("/") # %2F Go: package main
import (
"fmt";
"net/url";
)
func main() {
fmt.Println(url.PathEscape("-._~")) // -._~
fmt.Println(url.PathEscape("/")) // %2F
} Java encodes import java.net.URLEncoder;
import java.io.UnsupportedEncodingException;
public class HelloWorld {
public static void main(String[] args) {
try {
System.out.println(URLEncoder.encode("-._~", "UTF-8")); // -._%7E
System.out.println(URLEncoder.encode("/", "UTF-8")); // %2F
} catch (UnsupportedEncodingException e) {}
}
} |
In whatwg/html#3377 (comment) I have attempted to summarize what browsers do for If after that folks still want to pursue more drastic URL parser changes I suggest we do that in a new issue and keep this focused on the external string/templating case and the precedent for that in implementations. |
I think that a proposed "external string encode set" would exactly satisfy my request here. It wouldn't be specific to RPH, as you said, it's for any time you have a string with no idea of its contents and want to shove it in a URL. (i.e., JavaScript's |
And use it internally. This is also an initial step for #369.
And use it internally. This is also an initial step for #369.
Complements whatwg/html#5524. Closes #369.
Complements whatwg/html#5524. Closes #369.
Complements whatwg/html#5524. Closes #369.
This is a bit of a jumble of related issues that all stem from one root problem: URL Standard (unlike RFC 3986) does not have a concept of an "unreserved" character set. Apologies that this is a bit of an essay, but since these are all inter-related, I thought I would just group them into one discussion.
Why an unreserved set?
To give some background, RFC 3986's unreserved set (ASCII alphanumeric plus
-._~
) is the set of characters that are interchangeable in their percent-encoded and non-encoded forms: "URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource." (The earlier RFC 2396 defined a slightly larger unreserved set: ASCII alphanumeric plus!'()*-._~
, which will be relevant later.)In other words, the RFC divides the set of valid URL characters into two subsets: reserved and unreserved. Percent-encoding or percent-decoding a reserved character may change the meaning of the URL (e.g., "?abc=def" and "?abc%3Ddef" have different meanings). Percent-encoding or percent-decoding an unreserved character does not change the meaning of the URL (e.g., "/abc/" and "/%61%62%63/" should be considered equivalent, with "/abc/" being the normalized representation).
URL Standard does not have an equivalent concept, and this manifests as several problems (each of which could have its own bug, but I think it helps to group these together):
Parse(Serialize(url)) == Parse(Render(url))
should be true for all URLs.) Right now, there is no code point (with a few exceptions, but not the cases I'm talking about) that can be decoded without changing the way the URL parses.So how does adding an unreserved set help with these?
What should be in the unreserved set?
So what characters should be in the unreserved set? I propose three alternatives (from largest to smallest):
!$'()*,-.;_~
. This reserves the bare minimum set of characters. I compiled this list by carefully reading the URL standard and deciding whether each ASCII character has any special meaning. The above list of characters have no intrinsic meaning anywhere in the URL standard (note that '.' has special meaning in single- and double-dot path segments, but "." and "%2E" are already considered equivalent in that regard).!'()*-._~
. This matches RFC 2396, the older IETF standard.-._~
. This matches RFC 3986.Of these, I prefer option 2 (match RFC 2396). Option 1 is the most "logical" because it can be directly derived from reading the rest of the spec, but it doesn't leave any room for either this spec, or any individual schemes, to add their own special meaning to any new characters (which was the purpose of reserved characters in the first place). Option 3 matches the most recent IETF URL specification, which deliberately moved
!'()*
into the reserved set, but I don't think this move had much impact on implementations. For example, encodeURIComponent still uses the reserved set from RFC 2396. Option 2 exactly matches the encode set of encodeURIComponent. Furthermore, choosing Option 2 more or less matches Chrome's current behaviour (though it differs from one context to another, as discussed below).An open question is whether non-ASCII characters should appear in the unreserved set. This mostly doesn't matter, because all non-ASCII characters are in all of the percent-encode sets, so they always get normalized into encoded form. Technically, they act like unreserved characters because the URL semantics doesn't change as you encode/decode them. But I am leaving them out because the unreserved characters should be those that normalize to being decoded.
Study of current implementations
A WHATWG standard is supposed to describe how implementations actually behave. My experiments with Chrome 63 and Firefox 52 suggest that implementations do not follow the current URL standard at all, and are much closer to matching what I suggest above. (Disclaimer: I work for Google on the Chrome team.)
URL equivalence
I can't find a good built-in way on the browser side to test URL equivalence (since the
URL
class has no equivalence method). But we can use this function to test equivalence of URL strings, based on the browser's implementation of URL parsing and serializing:Here, Chrome mostly matches RFC 3986's notion of syntax-based equivalence:
urlStringsEquivalent('a', 'a')
: trueurlStringsEquivalent('a', '%61')
: true (normalized to 'a')urlStringsEquivalent('~', '%7E')
: true (normalized to '~')urlStringsEquivalent('=', '%3D')
: false (not normalized)urlStringsEquivalent('*', '%2A')
: false (not normalized)urlStringsEquivalent('<', '%3C')
: true (normalized to '%3C')Specifically, Chrome's URL parser decodes all characters in the RFC 3986 unreserved set: ASCII alphanum plus
-._~
.But Chrome also fails to normalize case when it doesn't decode a percent-encoded sequence:
urlStringsEquivalent('%6e', '%6E')
: true (normalized to 'n')urlStringsEquivalent('%3d', '%3D')
: false (not normalized)Firefox, on the other hand, follows the current URL standard:
urlStringsEquivalent('a', 'a')
: trueurlStringsEquivalent('a', '%61')
: false (not normalized)urlStringsEquivalent('~', '%7E')
: false (not normalized)urlStringsEquivalent('=', '%3D')
: false (not normalized)urlStringsEquivalent('*', '%2A')
: false (not normalized)urlStringsEquivalent('<', '%3C')
: true (normalized to '%3C')urlStringsEquivalent('%6e', '%6E')
: false (not normalized)urlStringsEquivalent('%3d', '%3D')
: false (not normalized)In my opinion, the spec and Firefox should change so that these "unreserved" characters (particularly alphanumeric) are equivalent to their percent-encoded counterparts. Though I think instead of using Chrome's set (RFC 3986), we should use RFC 2396, for compatibility with encodeURIComponent (hence
urlStringsEquivalent('*', '%2A')
should return true).URL rendering
Paste this URL in the address bar:
Chrome decodes the following characters: ASCII alphanum, non-ASCII, and
"-.<>_~
. All other characters remain encoded. This is the RFC 3986 unreserved set, plus"<>
, which is the intersection of the URL Standard fragment and query encode sets (those three characters are always encoded by the parser, so like unreserved characters, they have the same semantics whether encoded or not).Parse(Serialize(url)) == Parse(Render(url))
is true for Chrome for all URLs.Firefox decodes the following characters: ASCII alphanum, non-ASCII, backtick, and
!"'()*-.<>[\]^_{|}~
. All other characters remain encoded. This is the RFC 2396 unreserved set, plus backtick, and"<>[\]^_{|}
. I'm not sure what the rationale behind Firefox's decode set is.Parse(Serialize(url)) == Parse(Render(url))
is not true for Firefox. For example, the URL "https://example.com/%2A":Parse(Serialize(url))
gives "https://example.com/%2A", whileParse(Render(url))
gives "https://example.com/*".Clearly, neither of these implementations follow the standard, which says to decode all characters. Therefore, the spec should change to more closely match implementations. Preferably RFC 2396's unreserved set, for consistency. We could also throw in
"
,<
and>
, since these will be re-encoded upon parsing. Whatever is decided, it should be the case thatParse(Serialize(url)) == Parse(Render(url))
.Encoding arbitrary strings
Taking a look at registerProtocolHandler's escaping behaviour when a URL is escaped before being substituted into the "%s" template string. The spec says to escape it with the "default encode set", which no longer exists but links to path percent-encode set, which is: C0 control chars, space, backtick, non-ASCII, and
"#<>?{}
.I'll test this by navigating to httpbin and running this code in the Console:
Now a malicious site can inject other query parameters by linking you to "mailto:foo@example.com&launchmissiles=true".
According to the spec, this is supposed to open https://httpbin.org/get?address=mailto:foo@example.com&launchmissiles=true. That's a query parameter injection attack. httpbin displays:
Fortunately, Chrome and Firefox both encode many more characters. In both cases, they open https://httpbin.org/get?address=mailto%3Afoo%40example.com%26launchmissiles%3Dtrue, so the '&' and '=' are correctly interpreted as part of the email address, not separate arguments. httpbin displays:
Chrome uses the RFC 2396 reserved set, matching encodeURIComponent. Firefox leaves off a few more characters:
<>[\]{|}
(but nothing important).I think the correct fix is to change
registerProtocolHandler
's spec (in HTML) to matchencodeURIComponent
. However, there isn't an easy way to do that, short of calling into the ECMAScript-definedencodeURIComponent
method, or explicitly listing all characters. If we had an appropriate "reserved set" or "default encode set" (the complement of the unreserved set) in the URL Standard, thenregisterProtocolHandler
can just use that.Note that I am developing the Web Share Target API and need basically the same thing as
registerProtocolHandler
. At the moment, I've written it as "userinfo percent-encode set", but that still doesn't capture all the characters I need (especially '&').Recommendations
Given all of the above, I would like to make the following changes to URL Standard:
!'()*-._~
(which matches RFC 2396, but the exact set is debatable).registerProtocolHandler
is automatically fixed. Otherwise we have to updateregisterProtocolHandler
to use the reserved set's name instead.)encodeURIComponent
function."
,<
and>
(the intersection of the fragment and query encode sets). Note that this algorithm should satisfyParse(Serialize(url)) == Parse(Render(url))
for all URLs.Doing so would solve a number of issues outlined above, and bring the spec much closer to existing implementations. It would then make sense to update implementations to match the new spec.
I am quite familiar with the URL Standard and am volunteering to make the required changes, if there is consensus. Also, I don't strictly need there to be a reserved / unreserved set. These three problems could be fixed individually. But it makes the most sense to conceptualize this as reserved vs unreserved, and then tie a bunch of other definitions off of those concepts.
Regards,
Matt
The text was updated successfully, but these errors were encountered: