UCS2 to UTF8MB4 Encoding error #1438

dwash96 · 2024-07-24T13:33:48Z

Not sure if this is expected for fixed-length to variable length charset conversions but for rows read from the binlog into the gho table during the migration, the utility does not do an equivalent character conversion, it just uses the bytes given with the new charset. I could see why this behavior might make sense (charsets may not necessarily have truly equivalent representations of the same displayed character?) and is a skill issue in my part but in case this is not expected behavior, the example I saw during my migration was as follows:

"M" in UCS2: 0x004D
"M" in UTF8MB4: 0x4D

I am in the process of fixing it in my DB by running UPDATE statements in the format:

{column} = IF(CAST({column} AS BINARY) LIKE CONCAT(0x00, '%'), CONVERT(CONVERT(CAST({column} AS BINARY) USING UCS2), USING UTF8MB4), {column})

With some appropriate filters to not have to attempt to update every row of the new table of course. Is this something that gh-ost should handle itself? I think the format above allows specifically UCS2 to UTF8, if it were ran against the gho table rows upon insertion/update from the binlog but it's probably not portable outside this specific case.

Either way, this is a fantastic tool and I really appreciate that you all made this. Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCS2 to UTF8MB4 Encoding error #1438

UCS2 to UTF8MB4 Encoding error #1438

dwash96 commented Jul 24, 2024

UCS2 to UTF8MB4 Encoding error #1438

UCS2 to UTF8MB4 Encoding error #1438

Comments

dwash96 commented Jul 24, 2024