Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM migrates table with incorrect collation #3420

Closed
dveeden opened this issue Nov 11, 2021 · 3 comments · Fixed by #3575, #3769, pingcap/tidb#30531, pingcap/tidb#31114 or #4121
Closed

DM migrates table with incorrect collation #3420

dveeden opened this issue Nov 11, 2021 · 3 comments · Fixed by #3575, #3769, pingcap/tidb#30531, pingcap/tidb#31114 or #4121
Assignees
Labels
affects-5.3 area/dm Issues or PRs related to DM. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/major type/bug The issue is confirmed as a bug.

Comments

@dveeden
Copy link
Contributor

dveeden commented Nov 11, 2021

What did you do?

On MySQL 5.7.36:

mysql-5.7.36-log [test] > show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)

mysql-5.7.36-log [test] > SELECT id,c1,c1='test🔥🔥🔥',c1='Test🔥🔥🔥' FROM t1;
+----+------------------+--------------+--------------+
| id | c1               | c1='test???' | c1='Test???' |
+----+------------------+--------------+--------------+
|  1 | test🔥🔥🔥             |            1 |            1 |
+----+------------------+--------------+--------------+
1 row in set (0.00 sec)

mysql-5.7.36-log [test] > SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| character_set_client     | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database   | utf8mb4 |
| character_set_filesystem | binary  |
| character_set_results    | utf8mb4 |
| character_set_server     | utf8mb4 |
| character_set_system     | utf8    |
+--------------------------+---------+
7 rows in set (0.00 sec)

mysql-5.7.36-log [test] > SHOW SESSION VARIABLES LIKE 'collation_%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database   | utf8mb4_general_ci |
| collation_server     | utf8mb4_general_ci |
+----------------------+--------------------+
3 rows in set (0.00 sec)

On TiDB 5.2.1:

tidb-5.7.25-TiDB-v5.2.1 [test] > show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=30002
1 row in set (0.17 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SELECT id,c1,c1='test🔥🔥🔥',c1='Test🔥🔥🔥' FROM t1;
+----+------------------+-----------------------+-----------------------+
| id | c1               | c1='test🔥🔥🔥'             | c1='Test🔥🔥🔥'             |
+----+------------------+-----------------------+-----------------------+
|  1 | test🔥🔥🔥             |                     1 |                     0 |
+----+------------------+-----------------------+-----------------------+
1 row in set (0.17 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| character_set_client     | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database   | utf8mb4 |
| character_set_filesystem | binary  |
| character_set_results    | utf8mb4 |
| character_set_server     | utf8mb4 |
| character_set_system     | utf8    |
+--------------------------+---------+
7 rows in set (0.18 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SHOW SESSION VARIABLES LIKE 'collation_%';
+----------------------+-------------+
| Variable_name        | Value       |
+----------------------+-------------+
| collation_connection | utf8mb4_bin |
| collation_database   | utf8mb4_bin |
| collation_server     | utf8mb4_bin |
+----------------------+-------------+
3 rows in set (0.18 sec)

What did you expect to see?

I would expect the collation for the c1 column on the target (TiDB) to match the source (MySQL 5.7).

mysql-5.7.36-log [test] > SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.columns WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';
+-------------+--------------------+--------------------+
| COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME     |
+-------------+--------------------+--------------------+
| id          | NULL               | NULL               |
| c1          | utf8mb4            | utf8mb4_general_ci |
+-------------+--------------------+--------------------+
2 rows in set (0.00 sec)
tidb-5.7.25-TiDB-v5.2.1 [test] > SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.columns WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';
+-------------+--------------------+----------------+
| COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME |
+-------------+--------------------+----------------+
| id          | NULL               | NULL           |
| c1          | utf8mb4            | utf8mb4_bin    |
+-------------+--------------------+----------------+
2 rows in set (0.19 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SHOW COLLATION WHERE Charset='utf8mb4';
+--------------------+---------+------+---------+----------+---------+
| Collation          | Charset | Id   | Default | Compiled | Sortlen |
+--------------------+---------+------+---------+----------+---------+
| utf8mb4_bin        | utf8mb4 |   46 | Yes     | Yes      |       1 |
| utf8mb4_general_ci | utf8mb4 |   45 |         | Yes      |       1 |
| utf8mb4_unicode_ci | utf8mb4 |  224 |         | Yes      |       1 |
+--------------------+---------+------+---------+----------+---------+
3 rows in set (0.17 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled';
+----------------+
| VARIABLE_VALUE |
+----------------+
| True           |
+----------------+
1 row in set (0.16 sec)

What did you see instead?

The default collation for utf8mb4.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

[dvaneeden@dve-carbon ~]$ /home/dvaneeden/dev/pingcap/dm/bin/dm-worker -V
Release Version: v2.0.0-beta.2-562-g63f57436
Git Commit Hash: 63f574362d0aa1452e43eca3feab4a9ca2d32ee2
Git Branch: master
UTC Build Time: 2021-11-11 10:46:55
Go Version: go version go1.16.8 linux/amd64

Upstream MySQL/MariaDB server version:

mysql-5.7.36-log [test] > SELECT VERSION();
+------------+
| VERSION()  |
+------------+
| 5.7.36-log |
+------------+
1 row in set (0.00 sec)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

tidb-5.7.25-TiDB-v5.2.1 [test] > SELECT tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v5.2.1
Edition: Community
Git Commit Hash: cd8fb24c5f7ebd9d479ed228bb41848bd5e97445
Git Branch: heads/refs/tags/v5.2.1
UTC Build Time: 2021-09-08 02:32:56
GoVersion: go1.16.4
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.17 sec)

How did you deploy DM: tiup or manually?

manually

Other interesting information (system version, hardware config, etc):

  • MySQL is not using GTID

current status of DM cluster (execute query-status <task-name> in dmctl)

[dvaneeden@dve-carbon dm_charset]$ tiup dmctl --master-addr=127.0.0.1:8261 query-status testmig
Starting component `dmctl`: /home/dvaneeden/.tiup/components/dmctl/v2.0.7/dmctl/dmctl --master-addr=127.0.0.1:8261 query-status testmig
{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": true,
            "msg": "",
            "sourceStatus": {
                "source": "mysql57",
                "worker": "worker1",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
                {
                    "name": "testmig",
                    "stage": "Running",
                    "unit": "Sync",
                    "result": null,
                    "unresolvedDDLLockID": "",
                    "sync": {
                        "totalEvents": "249",
                        "totalTps": "0",
                        "recentTps": "0",
                        "masterBinlog": "(dve-carbon-bin.000001, 2748)",
                        "masterBinlogGtid": "",
                        "syncerBinlog": "(dve-carbon-bin.000001, 2748)",
                        "syncerBinlogGtid": "",
                        "blockingDDLs": [
                        ],
                        "unresolvedGroups": [
                        ],
                        "synced": true,
                        "binlogType": "remote",
                        "secondsBehindMaster": "0"
                    }
                }
            ]
        }
    ]
}
@dveeden dveeden added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Nov 11, 2021
@dveeden
Copy link
Contributor Author

dveeden commented Nov 11, 2021

This is how I manually fixed this table. But with more tables and/or a constant stream of new tables this isn't a good workaround.

tidb-5.7.25-TiDB-v5.2.1 [test] > ALTER TABLE t1 MODIFY COLUMN c1 varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL;
Query OK, 0 rows affected (0.27 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SELECT id,c1,c1='test🔥🔥🔥',c1='Test🔥🔥🔥' FROM t1;
+----+------------------+-----------------------+-----------------------+
| id | c1               | c1='test🔥🔥🔥'             | c1='Test🔥🔥🔥'             |
+----+------------------+-----------------------+-----------------------+
|  1 | test🔥🔥🔥             |                     1 |                     1 |
+----+------------------+-----------------------+-----------------------+
1 row in set (0.17 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SHOW CREATE TABLE t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) COLLATE utf8mb4_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=30002
1 row in set (0.18 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.columns WHERE TABLE_SCHEMA='test' AND TABLE_NAME='t1';
+-------------+--------------------+--------------------+
| COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME     |
+-------------+--------------------+--------------------+
| id          | NULL               | NULL               |
| c1          | utf8mb4            | utf8mb4_general_ci |
+-------------+--------------------+--------------------+
2 rows in set (0.18 sec)

@dveeden
Copy link
Contributor Author

dveeden commented Nov 11, 2021

Two small differences do work correctly:

mysql-5.7.36-log [test] > SHOW CREATE TABLE t2\G
*************************** 1. row ***************************
       Table: t2
Create Table: CREATE TABLE `t2` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) CHARACTER SET utf8mb4 DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
1 row in set (0.00 sec)

mysql-5.7.36-log [test] > SHOW CREATE TABLE t3\G
*************************** 1. row ***************************
       Table: t3
Create Table: CREATE TABLE `t3` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
1 row in set (0.00 sec)

mysql-5.7.36-log [test] > SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.columns WHERE TABLE_SCHEMA='test' AND TABLE_NAME IN ('t2','t3');
+-------------+--------------------+--------------------+
| COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME     |
+-------------+--------------------+--------------------+
| id          | NULL               | NULL               |
| c1          | utf8mb4            | utf8mb4_general_ci |
| id          | NULL               | NULL               |
| c1          | utf8mb4            | utf8mb4_unicode_ci |
+-------------+--------------------+--------------------+
4 rows in set (0.00 sec)
tidb-5.7.25-TiDB-v5.2.1 [test] > SHOW CREATE TABLE t2\G
*************************** 1. row ***************************
       Table: t2
Create Table: CREATE TABLE `t2` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) COLLATE utf8mb4_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci AUTO_INCREMENT=30002
1 row in set (0.17 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SHOW CREATE TABLE t3\G
*************************** 1. row ***************************
       Table: t3
Create Table: CREATE TABLE `t3` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `c1` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci AUTO_INCREMENT=30002
1 row in set (0.17 sec)

tidb-5.7.25-TiDB-v5.2.1 [test] > SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.columns WHERE TABLE_SCHEMA='test' AND TABLE_NAME IN ('t2','t3');
+-------------+--------------------+--------------------+
| COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME     |
+-------------+--------------------+--------------------+
| id          | NULL               | NULL               |
| c1          | utf8mb4            | utf8mb4_general_ci |
| id          | NULL               | NULL               |
| c1          | utf8mb4            | utf8mb4_unicode_ci |
+-------------+--------------------+--------------------+
4 rows in set (0.19 sec)

MySQL doesn't show character set and collation in the output of SHOW CREATE TABLE and might at some places rely on the default collation for the character set, which differs between MySQL and TiDB.

Related items:

@lance6716
Copy link
Contributor

seems this is related to "The default collations in TiDB (binary collations, with the suffix _bin) are different than the default collations in MySQL (typically general collations, with the suffix _general_ci)"

https://docs.pingcap.com/tidb/stable/character-set-and-collation

so DM should try to explicit set collation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment