Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

cforger · 2014-01-15T00:43:52Z

Hello,

Thanks for your work on this, it's been most useful for me.

I have encountered one error and fixed it, so I'm passing it on to see if you agree it's an error, and if it needs inclusion as a revision.

Currently I'm moving contact data between programs, and it's crashing on the decode of some French names.

The problem is in _pack_string. It's calculating the length of the string before it's encoded to UTF-8.

I think you must encode the string before you find the length of it, as some characters need to encode as double-byte or longer.

An example would be the French name Allagbe, or the French word precedent , where the 'e' is with Acute (http://www.fileformat.info/info/unicode/char/e9/index.htm)

Python's encoder makes this b'Allagb\xc3\xa9', which is one byte longer than than the original string.

u-msgpack encodes this as b'\xa7Allagb\xc3\xa9' - notice how it's only 7 bytes long - it's trimming the \xa9 char from the msgpack.

When you feed this trimmed string through a .decode('utf-8') method, you'll crash with a python error : 'utf-8' codec can't decode byte 0xc3 in position 6: unexpected end of data

The solution is to encode to UTF-8 before calculating the string length, as detailed below:

def _pack_string(x):
x = x.encode('utf-8')
if len(x) <= 31:
return struct.pack("B", 0xa0 | len(x)) + x
elif len(x) <= 28-1:
return b"\xd9" + struct.pack("B", len(x)) + x
elif len(x) <= 216-1:
return b"\xda" + struct.pack(">H", len(x)) + x
elif len(x) <= 2**32-1:
return b"\xdb" + struct.pack(">I", len(x)) + x
else:
raise UnsupportedTypeException("huge string")

With this patch in place, I am able to pass all French names in u-msgpack without error.

-end-

vsergeev · 2014-01-16T17:47:33Z

Hi cforger,

Thank you for this detailed report and this fix! You are completely correct about the bug and it slipped my test cases. I have added "Allagbé" to the unit tests and your corresponding fix to address it. I'll be releasing 1.6 later today primarily with this fix, but also with some other improvements (module docstrings, module version tuple). Thanks again!

Fix from cforger in GitHub issue #8.

vsergeev · 2014-01-17T09:04:23Z

Fixed with 3a0aa1b. New release under tag v1.6 and on PyPI (https://pypi.python.org/pypi/u-msgpack-python).

cforger · 2014-01-17T22:26:24Z

Thanks for your prompt attention.

vsergeev added a commit that referenced this issue Jan 17, 2014

fix wide char unicode string serialization

3a0aa1b

Fix from cforger in GitHub issue #8.

vsergeev closed this as completed Jan 17, 2014

pyup-bot mentioned this issue Dec 29, 2017

Pin msgpack-python to latest version 0.4.8 fake-name/ReadableWebProxy#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

cforger commented Jan 15, 2014

vsergeev commented Jan 16, 2014

vsergeev commented Jan 17, 2014

cforger commented Jan 17, 2014

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

Comments

cforger commented Jan 15, 2014

vsergeev commented Jan 16, 2014

vsergeev commented Jan 17, 2014

cforger commented Jan 17, 2014