Don't quote boundary parameter for multipart unless necessary #2545

FichteFoll · 2017-11-22T02:42:01Z

What do these changes do?

Only quote multipart boundary when necessary.
Also accept byte-strings, although only in a limited form due to an
architectural restriction.
Also properly quote boundaries when they contain " or \ characters.

Are there changes in behavior for the user?

Boundaries provided by the user, which contain " or \ characters, are properly escaped and invalid characters such as \n cause an exception.

Related issue number

Fixes #2544.

Checklist

I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt
- The format is <Name> <Surname>.
- Please keep alphabetical order, the file is sorted by names.
Add a new news fragment into the CHANGES folder
- name it <issue_id>.<type> for example (588.bug)
- if you don't have an issue_id change it to the pr id after creating the pr
- ensure type is one of the following:
  - .feature: Signifying a new feature.
  - .bugfix: Signifying a bug fix.
  - .doc: Signifying a documentation improvement.
  - .removal: Signifying a deprecation or removal of public API.
  - .misc: A ticket has been closed, but it is not of interest to users.
- Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

Is it okay for me to add myself to CONTRIBUTORS.txt with my pseudonym? I saw nobody else doing it, which is why I didn't for now.

Also accept byte-strings, although only in a limited form due to an architectural restriction. Also properly quote boundaries when they contain `"` or `\` characters. Some servers don't understand quoted boundaries in multipart documents, which is why it should only be used when necessary.

codecov-io · 2017-11-22T02:50:49Z

Codecov Report

Merging #2545 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2545      +/-   ##
==========================================
+ Coverage   97.78%   97.78%   +<.01%     
==========================================
  Files          36       36              
  Lines        7225     7236      +11     
  Branches     1260     1262       +2     
==========================================
+ Hits         7065     7076      +11     
  Misses         55       55              
  Partials      105      105

Impacted Files	Coverage Δ
aiohttp/multipart.py	`95.47% <100%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 999cf08...41d8450. Read the comment docs.

asvetlov · 2017-11-22T13:02:25Z

aiohttp/helpers.py

+    #                  ; any VCHAR, except delimiters
+    # VCHAR           = %x21-7E
+    valid_tchar_regex = br"^[!#$%&'*+\-.^_`|~\w]+$"
+    if re.match(valid_tchar_regex, value):


Please use compiled regular exception

Could you explain the rationale for this?

Compiled regexs are faster, isn't it?

Yes, but only if they are cached properly so they aren't recompiled every time. Python does some internal caching for regular expressions already, so unless we need a lot of regular expressions, the internal cache should cover most use cases.

In fact, if we don't cache the regex and compile it every time we need it, we actually degrade performance compared to having Python (or the re module) handle caching automatically.

I propose to cache regexps at module level, like [here](https://github.com/aio-libs/aiohttp/blob/master/aiohttp/helpers.py#L452
re module has 512 slots for LRU cache of compiled regexps.
Thus the program may degrade performance if other code uses non-complied exceptions for solving own tasks. If the total regexps number exceeds 512 -- re will recompile them constantly.

This function will only be run once for every multipart document sent, so I believe the performance impact will be minimal at ~~best~~worst, but we can do it anyway if that's your standard.

Thanks.

It is the aiohttp standard.

The function runs once per multipart document but for high-load site it could clash with other handlers and their functionality.

asvetlov · 2017-11-22T13:02:44Z

aiohttp/helpers.py

+        return value
+
+    invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]"
+    if re.match(invalid_qdtext_char_regex, value):


Compile regex

kxepal · 2017-11-22T16:26:01Z

aiohttp/helpers.py

+    quoted_value_content = value.replace(b'\\', b'\\\\')  # %x5C
+    quoted_value_content = value.replace(b'"', b'\\"')  # %x22
+
+    return b'"' + quoted_value_content + b'"'


Why bytes here in and out?

The HTTP spec operates on bytes, which is why I designed this helper function to follow that. The only reason why we need to convert back to a string is because the underlying Payload API requires a string for header values. Or at least I think so. If it doesn't, I'll be glad to remove the conversion back to a unicode string.

I could of course special-case the function for just our situation here, but it should go into multipart.py then.

Also, we need to use bytes in the actual multipart body, so we should try to convert exactly that onto a potentially quoted parameter value.

kxepal · 2017-11-22T16:29:38Z

aiohttp/multipart.py

+        except (UnicodeEncodeError, UnicodeDecodeError):
+            raise ValueError('boundary should contain ASCII only chars') \
+                from None
+        boundary_value = format_parameter_value(self._boundary).decode('ascii')


Because it looks strange. We take string boundary, turns it into bytes, and then turn it back to stirng to apply formatting. This wouldn't be a bottleneck for sure, but looks like useless work.

kxepal · 2017-11-22T16:30:08Z

aiohttp/multipart.py

-        except UnicodeEncodeError:
-            raise ValueError('boundary should contains ASCII only chars')
-        ctype = 'multipart/{}; boundary="{}"'.format(subtype, boundary)
+            if isinstance(boundary, bytes):


What is the case for binary boundary input?

It is true to how it is actually used and defined in the spec. In this particular situations, users don't get more value out of it because we have to convert the boundary back to a unicode string for the Payload header. Otherwise characters above 0x80 could be used inside the boundary.

Ideally, all headers would work on byte strings (only or additionally), but I didn't consider this to be in the scope of this PR.

*bytes above 0x80 could be used. HTTP is an ascii-readable byte-based protocol.

kxepal · 2017-11-22T16:32:19Z

aiohttp/multipart.py

+            if isinstance(boundary, bytes):
+                boundary.decode('ascii')
+                self._boundary = boundary
+            else:


If this if-else still exists, would be good to move default boundary generation here into another branch. No point to do isinstance check if boundary passed as None since we know that it will not be a bytes.

It would complicate (or at least worsen code beauty) of the next part when we try to catch encoding errors.

kxepal · 2017-11-22T16:33:07Z

aiohttp/multipart.py

+                self._boundary = boundary
+            else:
+                self._boundary = boundary.encode('ascii')
+        except (UnicodeEncodeError, UnicodeDecodeError):


I would recommend to minimize exception catch scope.

FichteFoll · 2017-11-23T00:36:46Z

Regarding the header stuff, I think I can implement byte string headers and automatic conversion using ascii (and not utf-8!) reasonably by converting unicode strings in Payload's __init__, if necessary, and then changing MultipartWriter.append_payload to expect the headers to be byte strings already. I haven't looked at where else I would need to implement this especially regarding the main request/response's and would rather not persue that within this PR, but Payload seems to only be used by multipart related code (and its headers property is only accessed within MultipartWriter).

Regarding the motivation of that suggested change:
To my knowledge, the entire HEAD part (also in multipart) of HTTP is byte-based and only characters in the ASCII range are used by indicators (i.e. header names, values in the IANA encoding registry, etc.) while other bytes >=0x80 may be used in some locations, for example in the boundary parameter here. A such, I believe this internal representation should be reached as early as possible and only public API should accept unicode strings in the ASCII range because those translate to a single byte encoding transparently and to maintain backwards compatiblity, obviously. It's also more intuitive for most users because byte strings seem like an abstract concept to many, I think.

Also, what about my CONTIBUTORS.txt question?

fafhrd91 · 2017-11-23T02:30:56Z

Aiohttp treats header value as an utf-8. I think a lot of servers does the same nowadays.

asvetlov · 2017-11-23T07:17:17Z

Pseudonymin in CONTRIBUTORS -- it's up to you. You even can skip the step.
CONTRIBUTORS.txt is a kind of Hall of Fame. I you don't want to be here -- nobody insists.

asvetlov · 2017-11-23T07:23:23Z

Yes, historically aiohttp uses utf-8 for headers with a backdoor for getting bytes for communication with malformed client/server, maybe we need a hooks for sending custom bytes too.

Anyway for me there is no reason for switching to byte-based headers, especially for boundary -- the value perfectly fits in ascii, isn't it?

FichteFoll · 2017-11-23T16:49:52Z

No, technically bytes 0x80 to 0xff are allowed as well. When quoted, that is. It's the obs-text (or char?) rule in the abnf I included in the issue and in the comment for the helper function. (I'm on mobile, otherwise I'd quote it myself.)

Again, if you don't feel good about the weird conversion between bytes and unicode, I can remove it and it won't be backwards incompatible. It wouldn't allow any bytes higher than 0x7f, but the current implementation doesn't as well and it's perfectly backwards compatible. It just bothers me in a way that the specification isn't respected entirely.

asvetlov · 2017-11-23T20:06:33Z

@FichteFoll don't get me wrong.
The PR exists for fixing current 100% correct implementation for sake of better compatibility with software which is in turn not 100% follows the specification.
I have a guts feeling that allowing non-ascii boundary will break other software (but pretty good from RFC). Let's restrict boundary to ascii on multipart message generation -- this way is:

Backward compatible with current aiohttp implementation
Cannot break other software

We can return to discussion if somebody will complain aiohttp in not supporting non-ascii boundaries.

FichteFoll · 2017-11-24T19:58:53Z

I agree with this sentiment. Will adjust the PR tomorrow.

Do you want me to squash? Otherwise I'd just commit on top of this so we can keep the other commit in case this ever comes up. It might end up being useful, maybe.

asvetlov · 2017-11-24T20:07:25Z

Just merge master and add required changes.
PR is squashed before merging, PR's history doesn't matter.

asvetlov · 2017-11-25T10:16:33Z

aiohttp/helpers.py

+
+    invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]"
+    if re.match(invalid_qdtext_char_regex, value):
+        raise ValueError("parameter value contains invalid characters")


The line is not covered by tests

Also - fix an error when escaping backslashes in the quoted parameter, - cache compiled regular expressions (by aiohttp standards), and - cover all lines in tests.

There is little use in having this property public. If a use case arises, it could easily be made public in the future.

asvetlov · 2017-11-29T16:18:16Z

@kxepal please review again

kxepal

LGFM, but there are two bits remains that concerns me.

kxepal · 2017-12-01T09:50:55Z

aiohttp/multipart.py

+        if re.match(self._valid_tchar_regex, value):
+            return value.decode('ascii')  # cannot fail
+
+        if re.match(self._invalid_qdtext_char_regex, value):


I think you mean re.search. Because match will not match anything if string starts with valid chars.

Good call. Fixed that and another issue with the previous regex not considering that $ also matches before a newline.

kxepal · 2017-12-01T10:17:00Z

tests/test_multipart.py

    def test_bad_boundary(self):
        with self.assertRaises(ValueError):
            aiohttp.multipart.MultipartWriter(boundary='тест')
+        with self.assertRaises(ValueError):
+            aiohttp.multipart.MultipartWriter(boundary='\n')


How about boundary=test\n? (;

kxepal · 2017-12-03T18:54:21Z

@FichteFoll, Thanks! @asvetlov?

asvetlov · 2017-12-04T11:40:09Z

docs/multipart_reference.rst

   .. attribute:: boundary

+      The byte-string representation of the boundary.


The type bothers me.
We pass optional str value into constructor but the property is byteish.
Maybe the method should return self._boundary.decode('ascii')?
It is a breaking change but we could do it in aiohttp 3.0.
Amount of affected user code is very low IMHO.

This behavior is unchanged. I just documented it.

https://github.com/aio-libs/aiohttp/pull/2545/files#diff-235eeadd1960735e1191879aa500b7b7L637

Yes, I see.
But maybe now is the time for behavior changing?
Current one is confusing a little.

Well, imo everything should be saved as byte-strings anyway (as I explained before), so having the boundary as a byte-string doesn't seem far off. It's only used that way, too, so either we'd add an additional property that converts the boundary from a unicode to a byte string, or we encode it at every point we need it (not a fan).

I'd say we leave this for now. I would rather spend time to convert all header strings to byte-strings rather than unwrapping the boundary here.

I fear it was bytes all the time. Let's change that for 3.0! This could be done with separate PR with breaking change notice.

Ok, let's merge.
Technically single Pull Request can have several records in CHANGES/ but I'm ok with separate PR.

@FichteFoll I have a feeling that all headers should be either strings or bytes.
All means really all, including request.headers etc.
This change in too intrusive, the ship has sailed many years ago.
Also I think 99% aiohttp users are happy with string headers but byte header values will confuse them.

I agree that this should ideally be standardized on a single header type. For compatibility, all unicode strings could be converted to byte-strings using ASCII encoding, so that the outgoing API doesn't change much except that it allows byte-strings for all headers too.

However, this goes way beyond the scope of this PR and would be a large architectural consideration that I'm sure you're going to have a much better time deciding on. It might be something worth considering for the next major release, i.e. 3.0, as long as someone puts in the work.

asvetlov · 2017-12-04T19:56:17Z

@FichteFoll thanks!

lock · 2019-10-28T12:03:48Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

FichteFoll force-pushed the pr/unquoted_boundary branch from c63ed75 to 1aaf422 Compare November 22, 2017 02:44

asvetlov requested changes Nov 22, 2017

View reviewed changes

Merge branch 'master' into pr/unquoted_boundary

b5b41ee

kxepal requested changes Nov 22, 2017

View reviewed changes

asvetlov reviewed Nov 25, 2017

View reviewed changes

FichteFoll added 5 commits November 26, 2017 18:00

Remove byte-string support for boundary parameter

2185786

Also - fix an error when escaping backslashes in the quoted parameter, - cache compiled regular expressions (by aiohttp standards), and - cover all lines in tests.

Merge remote-tracking branch 'upstream/master' into pr/unquoted_boundary

330a2ac

Add myself to contributors

90d91ed

Line length

44046ce

boundary_value doesn't need to be public

e8179a2

There is little use in having this property public. If a use case arises, it could easily be made public in the future.

asvetlov mentioned this pull request Nov 29, 2017

Wrong cookies format when sending request #2571

Closed

Merge branch 'master' into pr/unquoted_boundary

c27c624

kxepal requested changes Dec 1, 2017

View reviewed changes

Fix NL characters at the end of a boundary

70ee621

kxepal approved these changes Dec 3, 2017

View reviewed changes

asvetlov reviewed Dec 4, 2017

View reviewed changes

Merge branch 'master' into pr/unquoted_boundary

41d8450

asvetlov approved these changes Dec 4, 2017

View reviewed changes

asvetlov merged commit 52dc7f0 into aio-libs:master Dec 4, 2017

asvetlov mentioned this pull request Dec 5, 2017

Make multipart.boundary a str property #2589

Merged

asvetlov added this to the 3.0 milestone Dec 5, 2017

lock bot added the outdated label Oct 28, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019

psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 28, 2019

		.. attribute:: boundary

		The byte-string representation of the boundary.

Don't quote boundary parameter for multipart unless necessary #2545

Don't quote boundary parameter for multipart unless necessary #2545

Conversation

FichteFoll commented Nov 22, 2017 • edited Loading

What do these changes do?

Are there changes in behavior for the user?

Related issue number

Checklist

codecov-io commented Nov 22, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FichteFoll Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FichteFoll Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

FichteFoll Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FichteFoll commented Nov 23, 2017

fafhrd91 commented Nov 23, 2017

asvetlov commented Nov 23, 2017

asvetlov commented Nov 23, 2017

FichteFoll commented Nov 23, 2017

asvetlov commented Nov 23, 2017

FichteFoll commented Nov 24, 2017 • edited Loading

asvetlov commented Nov 24, 2017

Choose a reason for hiding this comment

asvetlov commented Nov 29, 2017

kxepal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kxepal commented Dec 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asvetlov commented Dec 4, 2017

lock bot commented Oct 28, 2019

FichteFoll commented Nov 22, 2017 •

edited

Loading

codecov-io commented Nov 22, 2017 •

edited

Loading

FichteFoll Nov 22, 2017 •

edited

Loading

FichteFoll Nov 22, 2017 •

edited

Loading

FichteFoll Nov 22, 2017 •

edited

Loading

FichteFoll commented Nov 24, 2017 •

edited

Loading