Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of splitlines can lead to data loss on events with Unicode contents #9

Closed
davedoesdev opened this issue May 16, 2017 · 3 comments
Closed
Labels

Comments

@davedoesdev
Copy link
Contributor

davedoesdev commented May 16, 2017

The SSE spec (https://www.w3.org/TR/eventsource/), section 6, says:

stream        = [ bom ] *event
event         = *( comment / field ) end-of-line
comment       = colon *any-char end-of-line
field         = 1*name-char [ colon [ space ] *any-char ] end-of-line
end-of-line   = ( cr lf / cr / lf )

; characters
lf            = %x000A ; U+000A LINE FEED (LF)
cr            = %x000D ; U+000D CARRIAGE RETURN (CR)
space         = %x0020 ; U+0020 SPACE
colon         = %x003A ; U+003A COLON (:)
bom           = %xFEFF ; U+FEFF BYTE ORDER MARK
name-char     = %x0000-0009 / %x000B-000C / %x000E-0039 / %x003B-10FFFF
                ; a Unicode character other than U+000A LINE FEED (LF), U+000D CARRIAGE RETURN (CR), or U+003A COLON (:)
any-char      = %x0000-0009 / %x000B-000C / %x000E-10FFFF
                ; a Unicode character other than U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)

i.e. lines are delimited by line feed or carriage return.

However, since _read returns Unicode strings, when you call splitlines on each chunk, it will be split using a much greater set of delimiters than just line feed and carriage return:

https://docs.python.org/3/library/stdtypes.html#str.splitlines

This can lead to data loss.

@davedoesdev
Copy link
Contributor Author

davedoesdev commented May 16, 2017

Another option, given #8, is to make _read return a byte string so that splitlines will only split on newline and carriage return. Then do the decoding when yielding event.

@mpetazzoni
Copy link
Owner

You're right; nice catch! I may not have time to look into it this week. Do you think that's something you could submit a PR for?

@mpetazzoni mpetazzoni added the bug label May 16, 2017
@mpetazzoni mpetazzoni changed the title splitlines doesn't confirm to spec Use of splitlines can lead to data loss on events with Unicode contents May 16, 2017
mpetazzoni added a commit that referenced this issue Jun 11, 2017
Issue #9 Decode later, to avoid decoding in the middle of utf8 sequence
@mpetazzoni
Copy link
Owner

Fixed by #10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants