Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

Closed
cforger opened this issue Jan 15, 2014 · 3 comments
Closed

Comments

@cforger
Copy link

cforger commented Jan 15, 2014

Hello,

Thanks for your work on this, it's been most useful for me.

I have encountered one error and fixed it, so I'm passing it on to see if you agree it's an error, and if it needs inclusion as a revision.

Currently I'm moving contact data between programs, and it's crashing on the decode of some French names.

The problem is in _pack_string. It's calculating the length of the string before it's encoded to UTF-8.

I think you must encode the string before you find the length of it, as some characters need to encode as double-byte or longer.

An example would be the French name Allagbe, or the French word precedent , where the 'e' is with Acute (http://www.fileformat.info/info/unicode/char/e9/index.htm)

Python's encoder makes this b'Allagb\xc3\xa9', which is one byte longer than than the original string.

u-msgpack encodes this as b'\xa7Allagb\xc3\xa9' - notice how it's only 7 bytes long - it's trimming the \xa9 char from the msgpack.

When you feed this trimmed string through a .decode('utf-8') method, you'll crash with a python error : 'utf-8' codec can't decode byte 0xc3 in position 6: unexpected end of data

The solution is to encode to UTF-8 before calculating the string length, as detailed below:

def _pack_string(x):
x = x.encode('utf-8')
if len(x) <= 31:
return struct.pack("B", 0xa0 | len(x)) + x
elif len(x) <= 28-1:
return b"\xd9" + struct.pack("B", len(x)) + x
elif len(x) <= 2
16-1:
return b"\xda" + struct.pack(">H", len(x)) + x
elif len(x) <= 2**32-1:
return b"\xdb" + struct.pack(">I", len(x)) + x
else:
raise UnsupportedTypeException("huge string")

With this patch in place, I am able to pass all French names in u-msgpack without error.

-end-

@vsergeev
Copy link
Owner

Hi cforger,

Thank you for this detailed report and this fix! You are completely correct about the bug and it slipped my test cases. I have added "Allagbé" to the unit tests and your corresponding fix to address it. I'll be releasing 1.6 later today primarily with this fix, but also with some other improvements (module docstrings, module version tuple). Thanks again!

vsergeev added a commit that referenced this issue Jan 17, 2014
Fix from cforger in GitHub issue #8.
@vsergeev
Copy link
Owner

Fixed with 3a0aa1b. New release under tag v1.6 and on PyPI (https://pypi.python.org/pypi/u-msgpack-python).

@cforger
Copy link
Author

cforger commented Jan 17, 2014

Thanks for your prompt attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants