Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async payload parsing stores all in memory #263

Closed
tumb1er opened this issue Jan 29, 2015 · 22 comments
Closed

async payload parsing stores all in memory #263

tumb1er opened this issue Jan 29, 2015 · 22 comments
Labels

Comments

@tumb1er
Copy link
Contributor

tumb1er commented Jan 29, 2015

in aiohttp.web.Request:

        body = yield from self.read() # All payload has been read to memory
        content_charset = self.charset or 'utf-8'

        environ = {'REQUEST_METHOD': self.method,
                   'CONTENT_LENGTH': str(len(body)),
                   'QUERY_STRING': '',
                   'CONTENT_TYPE': self.headers.get(hdrs.CONTENT_TYPE)}
        # And here Multipart files are parsed and saved to temporary files
        fs = cgi.FieldStorage(fp=io.BytesIO(body),
                              environ=environ,
                              keep_blank_values=True,
                              encoding=content_charset)

so to handle 10GB multipart upload aiohttp will reserve 10GB RAM.

IMHO, payload parser should be also asynchronous.

@fafhrd91
Copy link
Member

That is trade off for simpler implementation. I'd like to streaming parser, but it requires a lot of work. So right now it could be implemented in external library and then can be included to aiohttp.web later when api get finalized

@tumb1er
Copy link
Contributor Author

tumb1er commented Jan 29, 2015

I have a proof-of-concept async parser, but there is no place in Request.post() method to replace implementation.

@asvetlov
Copy link
Member

IMHO It should be new method, not .post().

@kxepal
Copy link
Member

kxepal commented Jan 29, 2015

.post_async()?

@asvetlov
Copy link
Member

The question is not the method name but API: what object the method should return?

@tumb1er
Copy link
Contributor Author

tumb1er commented Jan 30, 2015

There are no obvious reasons to add new method (only different behavior of cgi multipart parser and "the new one" - if it is buggy). Request.post has a DOS-bug: 1-to-1 memory consumption with huge multipart request with files (because of cgi.FieldStorage).
Interface of Request.post results is good enough to replace cgi implementation with asynchronous one, i.e. https://github.com/andrew-d/python-multipart.
After completion of the new async post() request should contain binary data for form fields and opened file descriptors for totally saved to tmp files file fields.

By the way, Request._post_files_cache is only written, not used for read. Why?

@asvetlov
Copy link
Member

  1. new post() should return object with async streaming access to multipart content, not dict of FileField instances.
  2. Request._post_files_cache is required for holding strong reference to parsed cgi.FieldStorage. Garbage collector releases all fields without the attr.

@tumb1er
Copy link
Contributor Author

tumb1er commented Jan 30, 2015

  1. current post() divides request handler to "before payload reading started" and "payload has been completely read and parsed". So for streaming multipart it's indeed new method. But for such unusual handler behavior there is no need to provide new interface, handler just should parse request.content separately with it's own misterious way.
  2. But! If I do so, DOS-attack is still possible for all views that do yield from request.post()
  3. GC collector - OK, but should be commented in source code.

@asvetlov
Copy link
Member

@tumb1er would you make a Pull Request?

@tumb1er
Copy link
Contributor Author

tumb1er commented Jan 30, 2015

I'm not sure about production-ready async form parser.
Found some candidates:

  1. https://github.com/andrew-d/python-multipart has one confirmed bug (know how to fix) and no author activity since 2013.
  2. https://github.com/defnull/multipart - five years without author activity, and it does synchronous calls to inputstream.read and that might be a problem.
  3. Own async payload parser based on aiohttp.StreamParser. It works on correct requests, but has no tests and no error handling.

Options for pull-request:

  1. Add to aiohttp new requirement for python-multipart-"supported" which is just a fork of first library with current bugfixes. Fastest way, but a new requirement for aiohttp.
  2. Import python-multipart as a part of aiohttp. Also fast, but there is some redundancy with aiohttp parsers and this library.
  3. Develop own implementation until it passes all the tests from published libraries and pull-request it as a part of aiohttp. Slowest, but correct way :)

So, any suggestions from aiohttp author?

@kxepal
Copy link
Member

kxepal commented Jan 30, 2015

@tumb1er I think I'll be ready to make a PR with own multipart module on this weekends. Will take a look on test cases of projects you mentioned in case if I miss something.

@tumb1er
Copy link
Contributor Author

tumb1er commented Jan 30, 2015

@kxepal it is not only multipart, all those projects also parse x-www-form-urlencoded data and octet-stream

@kxepal
Copy link
Member

kxepal commented Jan 30, 2015

These mimetypes are about different data and ways to handle it. That's a separate issue.

@tumb1er
Copy link
Contributor Author

tumb1er commented Jan 30, 2015

@kxepal but current aiohttp version as single parser for all these and it is vulnerable to DOS attack for urlencoded content-type also.

@kxepal
Copy link
Member

kxepal commented Jan 30, 2015

@tumb1er one single parser for all is not a good thing. If I ask about x-www-form-urlencoded data I don't want it accidentally mutate into multipart thing. Let's keep things divided. And this discussion for another issue (:

tumb1er added a commit to tumb1er/aiohttp that referenced this issue Feb 2, 2015
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Feb 2, 2015
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Feb 2, 2015
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Feb 2, 2015
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Feb 2, 2015
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Feb 2, 2015
@rdbhost
Copy link

rdbhost commented Apr 15, 2015

I need a streaming upload parser, too, and will create pull-request for same.

How about this API?

added method for aiohttp.Request

def next_post_var(self):
    """returns a named-tuple  (name, mime_type, stream), where name and mime-type are strings,
            and stream is an asyncio.StreamReader.
       raises Exception if previous stream is not exhausted prior.
    """"

Usage example:

nextPostVar = yield from req.next_post_var()
while nextPostVar:
    if nextPostVar.name in useful_vars:
        yield from do_something(nextPostVar.name, nextPostVar.stream)
    else:
        yield from nextPostVar.stream.read()   # waste unwanted var
    nextPostVar = yield from req.next_post_var()

@fafhrd91
Copy link
Member

is it for client or for server?

@kxepal
Copy link
Member

kxepal commented Apr 15, 2015

@rdbhost is it about multipart? Since it's already support streaming (and only) parse and upload.

@rdbhost
Copy link

rdbhost commented Apr 16, 2015

It is for the server, and would handle 'multipart/form-data' at least, and possibly 'url-encoded' and others.

@fafhrd91
Copy link
Member

@rdbhost you should look to client multipart support, i'd prefer to extend it

@rdbhost
Copy link

rdbhost commented Apr 16, 2015

@fafhrd91

Yeah, I agree.

Until this morning, I was unaware of the multipart module, but looking it over today, it looks pretty extensive and would be at least a good start to server side streaming body parsing.

@tumb1er tumb1er closed this as completed Jan 25, 2016
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Jan 27, 2016
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Jan 27, 2016
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Jan 27, 2016
tumb1er added a commit to tumb1er/aiohttp that referenced this issue Jan 27, 2016
@lock
Copy link

lock bot commented Oct 29, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 29, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants