Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't quote boundary parameter for multipart unless necessary #2545

Merged
merged 10 commits into from
Dec 4, 2017

Conversation

FichteFoll
Copy link
Contributor

@FichteFoll FichteFoll commented Nov 22, 2017

What do these changes do?

Only quote multipart boundary when necessary.
Also accept byte-strings, although only in a limited form due to an
architectural restriction.

Also properly quote boundaries when they contain " or \ characters.

Are there changes in behavior for the user?

Boundaries provided by the user, which contain " or \ characters, are properly escaped and invalid characters such as \n cause an exception.

Related issue number

Fixes #2544.

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new news fragment into the CHANGES folder
    • name it <issue_id>.<type> for example (588.bug)
    • if you don't have an issue_id change it to the pr id after creating the pr
    • ensure type is one of the following:
      • .feature: Signifying a new feature.
      • .bugfix: Signifying a bug fix.
      • .doc: Signifying a documentation improvement.
      • .removal: Signifying a deprecation or removal of public API.
      • .misc: A ticket has been closed, but it is not of interest to users.
    • Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

Is it okay for me to add myself to CONTRIBUTORS.txt with my pseudonym? I saw nobody else doing it, which is why I didn't for now.

Also accept byte-strings, although only in a limited form due to an
architectural restriction.
Also properly quote boundaries when they contain `"` or `\` characters.

Some servers don't understand quoted boundaries in multipart documents,
which is why it should only be used when necessary.
@codecov-io
Copy link

codecov-io commented Nov 22, 2017

Codecov Report

Merging #2545 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2545      +/-   ##
==========================================
+ Coverage   97.78%   97.78%   +<.01%     
==========================================
  Files          36       36              
  Lines        7225     7236      +11     
  Branches     1260     1262       +2     
==========================================
+ Hits         7065     7076      +11     
  Misses         55       55              
  Partials      105      105
Impacted Files Coverage Δ
aiohttp/multipart.py 95.47% <100%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 999cf08...41d8450. Read the comment docs.

# ; any VCHAR, except delimiters
# VCHAR = %x21-7E
valid_tchar_regex = br"^[!#$%&'*+\-.^_`|~\w]+$"
if re.match(valid_tchar_regex, value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use compiled regular exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the rationale for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiled regexs are faster, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but only if they are cached properly so they aren't recompiled every time. Python does some internal caching for regular expressions already, so unless we need a lot of regular expressions, the internal cache should cover most use cases.

In fact, if we don't cache the regex and compile it every time we need it, we actually degrade performance compared to having Python (or the re module) handle caching automatically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to cache regexps at module level, like [here](https://github.com/aio-libs/aiohttp/blob/master/aiohttp/helpers.py#L452
re module has 512 slots for LRU cache of compiled regexps.
Thus the program may degrade performance if other code uses non-complied exceptions for solving own tasks. If the total regexps number exceeds 512 -- re will recompile them constantly.

Copy link
Contributor Author

@FichteFoll FichteFoll Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will only be run once for every multipart document sent, so I believe the performance impact will be minimal at bestworst, but we can do it anyway if that's your standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

  1. It is the aiohttp standard.
  2. The function runs once per multipart document but for high-load site it could clash with other handlers and their functionality.

return value

invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]"
if re.match(invalid_qdtext_char_regex, value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile regex

quoted_value_content = value.replace(b'\\', b'\\\\') # %x5C
quoted_value_content = value.replace(b'"', b'\\"') # %x22

return b'"' + quoted_value_content + b'"'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bytes here in and out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP spec operates on bytes, which is why I designed this helper function to follow that. The only reason why we need to convert back to a string is because the underlying Payload API requires a string for header values. Or at least I think so. If it doesn't, I'll be glad to remove the conversion back to a unicode string.

I could of course special-case the function for just our situation here, but it should go into multipart.py then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need to use bytes in the actual multipart body, so we should try to convert exactly that onto a potentially quoted parameter value.

except (UnicodeEncodeError, UnicodeDecodeError):
raise ValueError('boundary should contain ASCII only chars') \
from None
boundary_value = format_parameter_value(self._boundary).decode('ascii')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it looks strange. We take string boundary, turns it into bytes, and then turn it back to stirng to apply formatting. This wouldn't be a bottleneck for sure, but looks like useless work.

except UnicodeEncodeError:
raise ValueError('boundary should contains ASCII only chars')
ctype = 'multipart/{}; boundary="{}"'.format(subtype, boundary)
if isinstance(boundary, bytes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case for binary boundary input?

Copy link
Contributor Author

@FichteFoll FichteFoll Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true to how it is actually used and defined in the spec. In this particular situations, users don't get more value out of it because we have to convert the boundary back to a unicode string for the Payload header. Otherwise characters above 0x80 could be used inside the boundary.

Ideally, all headers would work on byte strings (only or additionally), but I didn't consider this to be in the scope of this PR.

Copy link
Contributor Author

@FichteFoll FichteFoll Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*bytes above 0x80 could be used. HTTP is an ascii-readable byte-based protocol.

if isinstance(boundary, bytes):
boundary.decode('ascii')
self._boundary = boundary
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this if-else still exists, would be good to move default boundary generation here into another branch. No point to do isinstance check if boundary passed as None since we know that it will not be a bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would complicate (or at least worsen code beauty) of the next part when we try to catch encoding errors.

self._boundary = boundary
else:
self._boundary = boundary.encode('ascii')
except (UnicodeEncodeError, UnicodeDecodeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to minimize exception catch scope.

@FichteFoll
Copy link
Contributor Author

Regarding the header stuff, I think I can implement byte string headers and automatic conversion using ascii (and not utf-8!) reasonably by converting unicode strings in Payload's __init__, if necessary, and then changing MultipartWriter.append_payload to expect the headers to be byte strings already. I haven't looked at where else I would need to implement this especially regarding the main request/response's and would rather not persue that within this PR, but Payload seems to only be used by multipart related code (and its headers property is only accessed within MultipartWriter).

Regarding the motivation of that suggested change:
To my knowledge, the entire HEAD part (also in multipart) of HTTP is byte-based and only characters in the ASCII range are used by indicators (i.e. header names, values in the IANA encoding registry, etc.) while other bytes >=0x80 may be used in some locations, for example in the boundary parameter here. A such, I believe this internal representation should be reached as early as possible and only public API should accept unicode strings in the ASCII range because those translate to a single byte encoding transparently and to maintain backwards compatiblity, obviously. It's also more intuitive for most users because byte strings seem like an abstract concept to many, I think.

Also, what about my CONTIBUTORS.txt question?

@fafhrd91
Copy link
Member

Aiohttp treats header value as an utf-8. I think a lot of servers does the same nowadays.

@asvetlov
Copy link
Member

Pseudonymin in CONTRIBUTORS -- it's up to you. You even can skip the step.
CONTRIBUTORS.txt is a kind of Hall of Fame. I you don't want to be here -- nobody insists.

@asvetlov
Copy link
Member

Yes, historically aiohttp uses utf-8 for headers with a backdoor for getting bytes for communication with malformed client/server, maybe we need a hooks for sending custom bytes too.

Anyway for me there is no reason for switching to byte-based headers, especially for boundary -- the value perfectly fits in ascii, isn't it?

@FichteFoll
Copy link
Contributor Author

No, technically bytes 0x80 to 0xff are allowed as well. When quoted, that is. It's the obs-text (or char?) rule in the abnf I included in the issue and in the comment for the helper function. (I'm on mobile, otherwise I'd quote it myself.)

Again, if you don't feel good about the weird conversion between bytes and unicode, I can remove it and it won't be backwards incompatible. It wouldn't allow any bytes higher than 0x7f, but the current implementation doesn't as well and it's perfectly backwards compatible. It just bothers me in a way that the specification isn't respected entirely.

@asvetlov
Copy link
Member

@FichteFoll don't get me wrong.
The PR exists for fixing current 100% correct implementation for sake of better compatibility with software which is in turn not 100% follows the specification.
I have a guts feeling that allowing non-ascii boundary will break other software (but pretty good from RFC). Let's restrict boundary to ascii on multipart message generation -- this way is:

  1. Backward compatible with current aiohttp implementation
  2. Cannot break other software

We can return to discussion if somebody will complain aiohttp in not supporting non-ascii boundaries.

@FichteFoll
Copy link
Contributor Author

FichteFoll commented Nov 24, 2017

I agree with this sentiment. Will adjust the PR tomorrow.

Do you want me to squash? Otherwise I'd just commit on top of this so we can keep the other commit in case this ever comes up. It might end up being useful, maybe.

@asvetlov
Copy link
Member

Just merge master and add required changes.
PR is squashed before merging, PR's history doesn't matter.


invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]"
if re.match(invalid_qdtext_char_regex, value):
raise ValueError("parameter value contains invalid characters")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is not covered by tests

Also
- fix an error when escaping backslashes in the quoted parameter,
- cache compiled regular expressions (by aiohttp standards), and
- cover all lines in tests.
There is little use in having this property public. If a use case
arises, it could easily be made public in the future.
@asvetlov
Copy link
Member

@kxepal please review again

Copy link
Member

@kxepal kxepal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM, but there are two bits remains that concerns me.

if re.match(self._valid_tchar_regex, value):
return value.decode('ascii') # cannot fail

if re.match(self._invalid_qdtext_char_regex, value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean re.search. Because match will not match anything if string starts with valid chars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Fixed that and another issue with the previous regex not considering that $ also matches before a newline.

def test_bad_boundary(self):
with self.assertRaises(ValueError):
aiohttp.multipart.MultipartWriter(boundary='тест')
with self.assertRaises(ValueError):
aiohttp.multipart.MultipartWriter(boundary='\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about boundary=test\n? (;

@kxepal
Copy link
Member

kxepal commented Dec 3, 2017

@FichteFoll, Thanks! @asvetlov?

.. attribute:: boundary

The byte-string representation of the boundary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type bothers me.
We pass optional str value into constructor but the property is byteish.
Maybe the method should return self._boundary.decode('ascii')?
It is a breaking change but we could do it in aiohttp 3.0.
Amount of affected user code is very low IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see.
But maybe now is the time for behavior changing?
Current one is confusing a little.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, imo everything should be saved as byte-strings anyway (as I explained before), so having the boundary as a byte-string doesn't seem far off. It's only used that way, too, so either we'd add an additional property that converts the boundary from a unicode to a byte string, or we encode it at every point we need it (not a fan).

I'd say we leave this for now. I would rather spend time to convert all header strings to byte-strings rather than unwrapping the boundary here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear it was bytes all the time. Let's change that for 3.0! This could be done with separate PR with breaking change notice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's merge.
Technically single Pull Request can have several records in CHANGES/ but I'm ok with separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FichteFoll I have a feeling that all headers should be either strings or bytes.
All means really all, including request.headers etc.
This change in too intrusive, the ship has sailed many years ago.
Also I think 99% aiohttp users are happy with string headers but byte header values will confuse them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should ideally be standardized on a single header type. For compatibility, all unicode strings could be converted to byte-strings using ASCII encoding, so that the outgoing API doesn't change much except that it allows byte-strings for all headers too.

However, this goes way beyond the scope of this PR and would be a large architectural consideration that I'm sure you're going to have a much better time deciding on. It might be something worth considering for the next major release, i.e. 3.0, as long as someone puts in the work.

@asvetlov asvetlov merged commit 52dc7f0 into aio-libs:master Dec 4, 2017
@asvetlov
Copy link
Member

asvetlov commented Dec 4, 2017

@FichteFoll thanks!

@asvetlov asvetlov added this to the 3.0 milestone Dec 5, 2017
@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bot:chronographer:provided There is a change note present in this PR outdated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some servers don't understand quoted boundaries in multipart documents
5 participants