source-mysql: Fix replication of text columns with non-Unicode character sets #1951

willdonnelly · 2024-09-16T23:40:58Z

When a text column uses a non-Unicode charset (such as the latin1 default used by MySQL 5.7 and below), values are written into the binlog using that encoding and it's the responsibility of the reader to decode them appropriately. Currently our replication value decoding doesn't do that and we just assume the bytes are a valid UTF-8 string.

(Backfill queries appear to handle text with non-Unicode character sets just fine, for what it's worth, the bug here is solely with how we handle text values in binlog events)

We get away with this most of the time because the default character set in modern MySQL / MariaDB versions is a multi-byte UTF-8 encoding, and because in practice most text is ASCII. But when both of those assumptions are violated at the same time, even in a relatively central case like a MySQL 5.7 database with the most basic of accented Latinate characters (as in French/Spanish/etc text), we're pretty blatantly mangling the data.

Note that the capture of these columns isn't exactly "incorrect" in the usual way we use the word, as the text we capture is still a consistent representation of the original source data. The strings will just have all non-ASCII characters replaced with the U+FFFD REPLACEMENT CHARACTER and the ASCII code points captured correctly. The problem here is that this is basically never what the user actually wants and we have all the information we need to faithfully translate the non-ASCII code points to the appropriate Unicode equivalents.

We should improve on this by:

Obtaining the column character set or collation name for each text column during discovery.
Using this information to do a charset-aware []byte -> string conversion at the point where we currently just assume they're valid UTF-8 and cast to a string.
So long as we don't support the complete set of all MySQL charset / collation names, we should throw an error somewhere along the line when trying to capture a table with an unsupported collation rather than mistranslating the text.

The text was updated successfully, but these errors were encountered:

jgraettinger assigned willdonnelly Sep 17, 2024

willdonnelly added the change:planned This is a planned change label Sep 17, 2024

willdonnelly linked a pull request Sep 24, 2024 that will close this issue

source-mysql: Decode text correctly in non-UTF8 character sets #1979

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-mysql: Fix replication of text columns with non-Unicode character sets #1951

source-mysql: Fix replication of text columns with non-Unicode character sets #1951

willdonnelly commented Sep 16, 2024 •

edited

Loading

source-mysql: Fix replication of text columns with non-Unicode character sets #1951

source-mysql: Fix replication of text columns with non-Unicode character sets #1951

Comments

willdonnelly commented Sep 16, 2024 • edited Loading

willdonnelly commented Sep 16, 2024 •

edited

Loading