Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperlink should treat + in query as %20 #129

Closed
twm opened this issue Apr 27, 2020 · 24 comments · Fixed by #145
Closed

Hyperlink should treat + in query as %20 #129

twm opened this issue Apr 27, 2020 · 24 comments · Fixed by #145
Labels

Comments

@twm
Copy link
Collaborator

twm commented Apr 27, 2020

Via twisted/klein#339, Hyperlink seems to be percent-encoding the + when parsing the URL:

>>> from hyperlink import parse
>>> u = parse(b"https://localhost/getme?name=hello,%20big+world&value=4321".decode('ascii'))
>>> u
DecodedURL(url=URL.from_text('https://localhost/getme?name=hello,%20big%2Bworld&value=4321'))
>>> u.to_uri().to_text().encode('ascii')
b'https://localhost/getme?name=hello,%20big%2Bworld&value=4321'

This seems like a bug to me. + in the query part should be treated like %20. For example, this is what the web API does:

u = new URL("https://localhost/getme?name=hello,%20big+world&value=4321");
console.log(u.searchParams.get("name"));
// outputs: hello, big world
@wsanchez wsanchez added the bug label Apr 27, 2020
@wsanchez
Copy link
Contributor

Well, we aren't first.

Brief foray into Stack Overflow seems to yield a lack of consensus as to correctness.

The Query String Article on Wikipedia (under URL Encoding) suggests this is a thing in HTML, which I read as distinct from a web browser… which makes the desired behavior possibly client-specific.

I might be reading the above wrong, and I haven't dug through RFCs for a definitive answer, but it seems like most of the time, replacing + with space is the not right thing to do…?

@mahmoud
Copy link
Member

mahmoud commented Jul 28, 2020

Oh man, I thought I responded to this, but you basically read my mind @wsanchez. It's a web-ism. Might be worth making a default (https is the default scheme in hyperlink after all). But yeah, I grappled with this a bit in my initial work and left the behavior as it is today.

Could make it a flag in .normalize() or similar?

@twm
Copy link
Collaborator Author

twm commented Jul 28, 2020

Ah, okay, so I am not sure that it need even be a flag. Since the hex-encoded form is universal it is reasonable to canonicalize to that. Noting this behavior in the documentation seems enough of a fix to me.

@twm
Copy link
Collaborator Author

twm commented Aug 15, 2020

I seem to have confused myself here, so I'm going to retrace my steps. We're trying to make use of Hyperlink in both client and server-side libraries, with slightly different requirements.

  1. For a client library (treq) it seems safe to encode space as %20 in all circumstances, rather than using + in querystrings. Any server that parses querystrings will have support for percent-escaped characters.
  2. For a server library (Klein) we want to use a DecodedURL instances to represent the request URI, so it must be able to decode querystrings generated by web browsers, which encode spaces as +. That means that this is a problem:
>>> u = DecodedURL.from_text('/foo/?q=a+b')
>>> u.get('q')
['a+b']

In the server-side use case it would be fine if we could pre-normalize the URL or otherwise cause DecodedURL.get() to process + as space, though I think it is still surprising that this isn't the default for http and https schemes. Perhaps there should be an additional parameter to register_scheme() a-la uses_netloc?

@wsanchez
Copy link
Contributor

I agree with 1.

When you say "querystrings generated by web browsers", what's an example of how you get a web browser to generate a query string?

@mahmoud
Copy link
Member

mahmoud commented Aug 17, 2020

<form> submit with "GET" action should do it, yeah?

@wsanchez
Copy link
Contributor

Right, to confirm this awesomeness, I wrote this script:

from textwrap import dedent

from klein import run, route

@route("/")
def home(request):
    request.setHeader("content-type", "text/html")
    return dedent(
        """
        <html>
          <head><title>Form</title></head>
          <body>
            <form action="" method="GET">
              <div>
                <input type="text" name="name" id="name" required>
              </div>
              <div>
                <input type="submit" value="Submit">
              </div>
            </form>
          </body>
        </html>
        """
    )

run("localhost", 8080)

And ran it, entered "hello, world" into the text field, and the URL I'm sent to in the (Safari) browser's location bar is http://localhost:8080/?name=hello%2C+world. The Klein log spits out that request.uri is b'/?name=hello%2C+world'.

So… back to @twm's retracing of steps, it does seem like dealing with web browsers requires that one convert those + characters to spaces at some point.

If Hyperlink doesn't do this, then it would seem like Klein should do on on request URIs, but to do so, it would have to tase out the query part and only do this there, which seems more like Hyperlink's domain…

If Hyperlink does this… let's say by default for simplicity, then the downside is… that someone put a + into a query and meant for that to be a plus, not a space. That someone could have instead encoded that as %2B, so a client like treq always encoding plus and space seems prudent (item 1, again).

It remains unclear by way of standards what a client is supposed to do with an un-encoded plus, I guess "what web browsers do" is the next best thing. Though I'm nervous about what I see in the location bar as

Trying to take HTML out of the picture, here am in the Safari JavaScript console to replicate the Web API @twm referenced:

> url = new URL("http://localhost:8080/?name=hello%2C+world")
< URL {href: "http://localhost:8080/?name=hello%2C+world", origin: "http://localhost:8080", protocol: "http:", username: "", password: "", …}
> url.search
< "?name=hello%2C+world"
> url.searchParams.get("name")
< "hello, world"

So the string representation of the URL leaves + in place. But when you tease our query values, they get converted to space.

So I'm not sure whether the + should be left alone when parsed, or covered to %20. What does seem appropriate is that by the time you get it back via a query API like this:

>>> url = DecodedURL.fromText("https://localhost/getme?name=hello,%20big+world&value=4321")
>>> url.get("name")
['hello, big+world']

…the API should probably have converted the + to a space.

@wsanchez
Copy link
Contributor

@glyph should weigh in here because he wrote some tests in Klein which apparently agree with @twm about this.

@glyph
Copy link
Collaborator

glyph commented Sep 16, 2020

DecodedURL should absolutely be converting + to space in this context, but URL should preserve the input format.

@glyph
Copy link
Collaborator

glyph commented Sep 16, 2020

Figuring out how all of WhatWG's spec fits together is extremely tedious, but it does contain this gem:

Replace any 0x2B (+) in name and value with 0x20 (SP).

Which suggests... something.

@glyph
Copy link
Collaborator

glyph commented Sep 16, 2020

(My understanding from @mahmoud's previous statements is that we are preferring WhatWG to the RFCs in hyperlink, as it's somewhat more up-to-date and pragmatic.)

@mahmoud
Copy link
Member

mahmoud commented Sep 16, 2020

Oh dang, not sure where I gave that impression*. WhatWG is browser/web-focused, I've been trying to make hyperlink more vanilla, not adopting all the HTTP-adjacent assumptions. So, leaning toward the earlier RFCs, using other prominent URL use cases (e.g., git, db drivers) to temper web influences.

I'm not sure what other protocols do with +, but I can see hyperlink being of some help, at least through a flag on parse or URL.normalize. If it's got to go into default behavior, then yeah, DecodedURL would be the spot. + is special, so I'm guessing on serialization, we're all OK if it roundtrips to %20? A flag would make that specialness more explicit.

As an aside, WHATWG's "plain English" description of URL parsing is one of my go-to example of how not to document technical specs. Just give us a good reference implementation already.

@glyph
Copy link
Collaborator

glyph commented Sep 16, 2020

+ is special, so I'm guessing on serialization, we're all OK if it roundtrips to %20? A flag would make that specialness more explicit.

No, but, the underlying URL object can easily maintain the relevant state. We already have lots of places where equivalent-when-decoded URLs still manage to round-trip with no loss of information.

@glyph
Copy link
Collaborator

glyph commented Sep 16, 2020

(If you rebuilt the URL via DecodedURL directly, then yeah, getting %20 for spaces in the query string seems fine to me.)

@mahmoud
Copy link
Member

mahmoud commented Sep 16, 2020

So in the event of a mixed URL, the underlying state would track the index of every space and whether it should be serialized as a + or a %20? Sounds intense, but maybe we're thinking a top-level plus_means_space flag?

@glyph
Copy link
Collaborator

glyph commented Sep 16, 2020

We already do this in the underlying URL. Consider:

>>> hyperlink.parse("https://example.com/multiple ways to encode%20spaces").path[0]
'multiple ways to encode spaces'

@wsanchez
Copy link
Contributor

wsanchez commented Sep 17, 2020

@glyph, that's not an ideal example because the special treatment of "+" is, I think, limited to query arguments, not the path.

So… to capture the more relevant (I think) current state of affairs:

>>> import hyperlink
>>> u = hyperlink.URL.from_text('https://example.com/thing?value=multiple ways+to encode%20spaces')
>>> u
URL.from_text('https://example.com/thing?value=multiple ways%2Bto encode%20spaces')
>>> u.query
(('value', 'multiple ways+to encode%20spaces'),)
>>> du = hyperlink.DecodedURL.from_text('https://example.com/thing?value=multiple ways+to encode%20spaces')
>>> du
DecodedURL(url=URL.from_text('https://example.com/thing?value=multiple ways%2Bto encode%20spaces'))
>>> du.query
(('value', 'multiple ways+to encode spaces'),)

@glyph I think you are asserting that Hyperlink current handling is correct, except that du.query[0][1] should have replaced ' ' (space) with '+'. Right?

That would be achievable, I think, by calling .replace("+", " ")somewhere in the implementation of the query property.

I think @mahmoud's concern (and mine) is that this isn't always correct, even if it's usually correct for a lot of users of Hyperlink, the distinction being "I expect RFC 3986" versus "I think I'm handling an HTML form submission".

Of course, this is all about HTML YOLO nonsense screwing up the world, since there's no guarantee (I think?) that code handling a URL would know that it's getting the URL via a form submission versus a non-HTML client. But forms are probably a common enough case that you have to go that route, and really clients should just replace '+' with '%2B' or '%20' to ensure that they are communicating the right thing.

So I'm not sure I'm on board with "DecodedURL should absolutely be converting + to space", and I think I agree with @mahmoud that an option should exist. I'm less confident about what the most reasonable default is.

@glyph
Copy link
Collaborator

glyph commented Sep 17, 2020

glyph I think you are asserting that Hyperlink current handling is correct, except that du.query[0][1] should have replaced ' ' (space) with '+'. Right?

Yep!

I think @mahmoud's concern (and mine) is that this isn't always correct, even if it's usually correct for a lot of users of Hyperlink, the distinction being "I expect RFC 3986" versus "I think I'm handling an HTML form submission".

So mirror-universe mahmoud quite effectively convinced me that WhatWG should take precedence, since it describes how people are practically using URLs in the vast majority of applications. Who would want behavior derived from a 15-year-old RFC that doesn't accurately describe the behavior of current browsers? What user-base is that?

Wikipedia pretty authoritatively states that plus gets encoded to space, citing… w3schools, which I guess is normative now https://en.wikipedia.org/wiki/Query_string#cite_note-w3schools-10

But, for the sake of argument, taking the "standards have a moral weight" position:

rfc3986 is updated by rfc7320. rfc7320 section 2.4 concedes:

HTML [W3C.REC-html401-19991224] constrains the syntax of query strings used in form submission. New form languages SHOULD NOT emulate it, but instead allow creation of a broader variety of URIs (e.g., by allowing the form to create new path components, and so forth).

So it seems that the IETF is on board with query strings being described more or less by HTML, and other URL-modifying languages that support describing forms should just do something else entirely.

But the fact that "HTML" describes ultimately leaves the server holding the bag. Which means that HTML implicitly defines the behavior of the HTTP[S] schemes with respect to decoding URLs, since the server might be handling an HTML form post under those schemes.

So I feel strongly that the default should absolutely be to decode + this way, but perhaps DecodedURL should have a plus_as_space flag which can be set to False as well as a scheme-level registration of there are schemes that want to opt out of this. This does leave open an unfortunate hole with schemeless URLs, because now you need an implied scheme to decode a relative URL like ?a=b+c, but I think that just means registering the null scheme as being like HTTP since relative-URL resolution is rarely done with other schemes.

That said, I do at least have a counterexample of a scheme where this type of processing is not performed: on a mac, open 'mailto:glyph@twistedmatrix.com?subject=hello+world' suggests that mailto is not HTTP-like in this regard.

@wsanchez
Copy link
Contributor

rfc3986 is updated by rfc7320. rfc7320 section 2.4 concedes:

I'd say "acknowledges the existence of some stupidity in HTML" instead of "concedes" but OK. :-)

I'm coming to the conclusion that a flag is not super helpful, and that if you want to ensure that your application's query parts are decoded correctly, never use a '+' and instead use '%'-encoding for space or plus, but that a receiver of URLs more or less has to assume that HTML trumped URL here.

I think that means that the damage is limited to query strings, and no other part of URLs. Does it apply to both names and values? The examples I've seen only show '+'-as-space in values, but I assume it's allowed in names also?

I'm having a hard time finding useful things in the WHATWG living document about this, but are HTML form submissions limited to HTTP(S) schemes? The method attribute on <form> says it specified the HTTP method to use, so that sounds like a yes. In which case, we should only do '+'-as-space in query strings for the HTTP(s) schemes and call it a day, right?

That said, I do at least have a counterexample of a scheme where this type of processing is not performed: on a mac, open 'mailto:glyph@twistedmatrix.com?subject=hello+world' suggests that mailto is not HTTP-like in this regard.

No, what that suggests is that applications that are not creating URLs form HTML forms are different than applications that are, which is the problem. I don't think the scheme is the relevant difference, except that HTML seems to be limited to HTTP(S).

The action attribute on <form> says it takes a URL, and it doesn't say it's limited to HTTP URLs. If you could write <form action="mailto:glyph@twistedmatrix.com">, then we could not assume that "mailto" is special, because really what is special is "URLs credited by HTML" and a form submitting to a mailto URL may well encode space as '+'.

@glyph
Copy link
Collaborator

glyph commented Sep 17, 2020

rfc3986 is updated by rfc7320. rfc7320 section 2.4 concedes:

I'd say "acknowledges the existence of some stupidity in HTML" instead of "concedes" but OK. :-)

I'm coming to the conclusion that a flag is not super helpful, and that if you want to ensure that your application's query parts are decoded correctly, never use a '+' and instead use '%'-encoding for space or plus, but that a receiver of URLs more or less has to assume that HTML trumped URL here.

Yeah, same. Let's prefer to encode on the input side with %20 - it doesn't seem like there's any advantage to using + except for human URL aesthetics (for which we could have a flag).

I think that means that the damage is limited to query strings, and no other part of URLs. Does it apply to both names and values? The examples I've seen only show '+'-as-space in values, but I assume it's allowed in names also?

<!DOCTYPE html>
<html>
    <body>
        <form method="GET" action="">
            <input name="here's a name" value="here's a value">
        </form>
    </body>
</html>

results in

?here%27s+a+name=here%27s+a+value

so I'd say, yes, in names as well.

I'm having a hard time finding useful things in the WHATWG living document about this, but are HTML form submissions limited to HTTP(S) schemes? The method attribute on <form> says it specified the HTTP method to use, so that sounds like a yes. In which case, we should only do '+'-as-space in query strings for the HTTP(s) schemes and call it a day, right?

In theory, no, you could have a mailto: form.

In practice, I can only test this with a file:///-scheme form, because mailto: is an "insecure" URL scheme and browsers block it from any https-hosted page. You can see this on the "try it yourself" on https://www.w3docs.com/snippets/html/how-to-create-mailto-forms.html . In such file:/// tests, though, Safari, at least, violates the W3C spec here! https://www.w3.org/TR/html52/sec-forms.html#mail-with-headers says that you have to

Replace occurrences of U+002B PLUS SIGN characters (+) in headers with the string "%20"

when a browser GETs the mailto: scheme; but if I make a form with <form method="GET" action="mailto:"><input name="subject" value="here's a value">, I get a subject with "+" in it in mail.app. So the ecosystem here is just… broken around this edge case.

That said, I do at least have a counterexample of a scheme where this type of processing is not performed: on a mac, open 'mailto:glyph@twistedmatrix.com?subject=hello+world' suggests that mailto is not HTTP-like in this regard.

No, what that suggests is that applications that are not creating URLs form HTML forms are different than applications that are, which is the problem. I don't think the scheme is the relevant difference, except that HTML seems to be limited to HTTP(S).

Yeah, you're right. And while it's practically limited to HTTP(S), this problem does theoretically leak out if browser support for other schemes improves.

The action attribute on <form> says it takes a URL, and it doesn't say it's limited to HTTP URLs. If you could write <form action="mailto:glyph@twistedmatrix.com">, then we could not assume that "mailto" is special, because really what is special is "URLs credited by HTML" and a form submitting to a mailto URL may well encode space as '+'.

It's not supposed to, per the spec, but... it does.

@wsanchez
Copy link
Contributor

At this point my opinion is that using an unencoded + in a URL, for any scheme, is broken thanks to HTML, so don't use it, and both "solutions" to this problem are wrong.

Further, the "go with HTML" solution is more wrong but also more practical. I can't find anything in the WHATWG spec spaghetti to file a bug against, so I'm giving up on any sort of correctness.

That said, either Hyperlink needs to change, or Glyph's test in Klein needs to go away, and that needs to happen soon, because the next Twisted release is going to require Klein to upgrade Treq, which now uses Hyperlink.

@glyph
Copy link
Collaborator

glyph commented Oct 26, 2020

both "solutions" to this problem are wrong

What do you mean by 'both "solutions"'?

There are a couple of issues surrounding "+" right now.

First is the issue of input preservation. parse(x) should always return DecodedURL(url=URL.from_text(x)) unless it raises an exception; parse(x).to_text() not being x is always a bug. This is a big selling point of hyperlink as a library; you can parse and treat things as structured without changing them. Therefore, this is mangling and needs to be fixed, regardless of what else changes:

>>> parse("/?+")
DecodedURL(url=URL.from_text('/?%2B'))

Next up, there's the question of what parse("/?x=a+b").get('x') should be.

I think that it's pretty clear that the right answer by default needs to be ['a b'] for practical reasons, but I understand the source of the disagreement on that one. There should be some way to get it to give you ['a+b'] instead, and arguably that should be the result for parse('mailto:?x=a+b').get('x') but this is an edge case that a much smaller audience cares about.

@wsanchez
Copy link
Contributor

wsanchez commented Oct 28, 2020

I'm on board with assert parse(x).to_text() == x.

No answer to the default behavior strikes me as correct, but I've decided not to care because it's broken, don't use un-encoded + or prepare to deal with possible nonsense.

@glyph
Copy link
Collaborator

glyph commented Dec 28, 2020

In the branch:

>>> parse("/?x=a+b").get('x')
['a b']

Plus the escape hatch:

>>> DecodedURL.from_text("/?x=a+b", query_plus_is_space=False).get('x')
['a+b']

It looks like someone else already fixed the parse / to_text round tripping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants