Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb: add description about GB18030 #18662

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

CbcWestwolf
Copy link
Member

@CbcWestwolf CbcWestwolf commented Sep 19, 2024

What is changed, added or deleted? (Required)

Add description about GB18030

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions (in Chinese).

  • master (the latest development version)
  • v9.0 (TiDB 9.0 versions)
  • v8.5 (TiDB 8.5 versions)
  • v8.4 (TiDB 8.4 versions)
  • v8.3 (TiDB 8.3 versions)
  • v8.2 (TiDB 8.2 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)
  • v5.3 (TiDB 5.3 versions)

What is the related PR or file link(s)?

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

@ti-chi-bot ti-chi-bot bot added missing-translation-status This PR does not have translation status info. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 19, 2024
@lilin90 lilin90 self-assigned this Sep 20, 2024
@lilin90 lilin90 added v8.4 This PR/issue applies to TiDB v8.4. translation/doing This PR’s assignee is translating this PR. labels Sep 20, 2024
@ti-chi-bot ti-chi-bot bot removed the missing-translation-status This PR does not have translation status info. label Sep 20, 2024
@lilin90 lilin90 requested review from lilin90 and hfxsd September 23, 2024 03:52
@lilin90 lilin90 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2024
character-set-gb18030.md Outdated Show resolved Hide resolved
character-set-gb18030.md Outdated Show resolved Hide resolved
character-set-gb18030.md Outdated Show resolved Hide resolved
dm/dm-best-practices.md Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Sep 23, 2024
Copy link

ti-chi-bot bot commented Sep 23, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-09-23 13:32:44.88699319 +0000 UTC m=+1486434.627417141: ☑️ agreed by hfxsd.

Co-authored-by: xixirangrang <hfxsd@hotmail.com>
character-set-gb18030.md Outdated Show resolved Hide resolved
character-set-gb18030.md Outdated Show resolved Hide resolved
character-set-gb18030.md Outdated Show resolved Hide resolved
character-set-gb18030.md Outdated Show resolved Hide resolved
character-set-gb18030.md Outdated Show resolved Hide resolved
```

若开启了新排序规则框架,则在二进制排序规则之外,额外支持 `utf8_general_ci` 和 `utf8mb4_general_ci` 两种大小写和口音不敏感的排序规则
若开启了新排序规则框架,则在二进制排序规则之外,额外支持若干种大小写和口音不敏感的排序规则
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的“若干”指代能否更具体一些?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 63942e5

CbcWestwolf and others added 2 commits September 25, 2024 10:31
Co-authored-by: Lilian Lee <lilin@pingcap.com>
@lilin90 lilin90 removed the v8.4 This PR/issue applies to TiDB v8.4. label Oct 8, 2024
@shawn0915
Copy link
Contributor

@CbcWestwolf 请问这里的GB18030是指 GB 18030-2022 么?

@CbcWestwolf
Copy link
Member Author

CbcWestwolf commented Oct 30, 2024

请问这里的GB18030是指 GB 18030-2022 么?

是的 @shawn0915

@shawn0915
Copy link
Contributor

请问这里的GB18030是指 GB 18030-2022 么?

是的 @shawn0915

这是之前企业版支持的特性?要不要明确说明一下是 GB 18030-2022 毕竟跟老国标是有区别的。

@lilin90 lilin90 added the v9.0 label Dec 30, 2024
@lilin90 lilin90 requested a review from Benjamin2037 December 30, 2024 06:46
@lilin90
Copy link
Member

lilin90 commented Dec 30, 2024

@CbcWestwolf Would you please update the corresponding descriptions in this PR from v8.4 to v9.0, and add description about #18662 (comment)?

@lilin90 lilin90 requested a review from hfxsd December 30, 2024 09:40
@CbcWestwolf
Copy link
Member Author

@lilin90 Sure, let me update it after merging the code.

@lilin90 lilin90 assigned hfxsd and unassigned lilin90 Jan 6, 2025
Copy link

ti-chi-bot bot commented Jan 6, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hfxsd, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


使用 `utf8_general_ci`、`utf8mb4_general_ci`、`utf8_unicode_ci`、`utf8mb4_unicode_ci`、`utf8mb4_0900_ai_ci` 和 `gbk_chinese_ci` 中任一种时,字符串之间的比较是大小写不敏感 (case-insensitive) 和口音不敏感 (accent-insensitive) 的。同时,TiDB 还修正了排序规则的 `PADDING` 行为:
使用 `utf8_general_ci`、`utf8mb4_general_ci`、`utf8_unicode_ci`、`utf8mb4_unicode_ci`、`utf8mb4_0900_ai_ci`、`gbk_chinese_ci` 和 `gb18030_chinese_ci` 中任一种时,字符串之间的比较是大小写不敏感 (case-insensitive) 和口音不敏感 (accent-insensitive) 的。同时,TiDB 还修正了排序规则的 `PADDING` 行为:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L580 优先级排序 需要加上 gbk_chinese_cigb18030_chinese_ci 吗?

character-set-gb18030.md Outdated Show resolved Hide resolved
character-set-gb18030.md Outdated Show resolved Hide resolved
* 在系统变量 [`character_set_client`](/system-variables.md#character_set_client) 和 [`character_set_connection`](/system-variables.md#character_set_connection) 没有同时设置为 `gb18030` 的情况下,TiDB 处理非法字符的方式与 MySQL 一致。
* 在 `character_set_client` 和 `character_set_connection` 同时设置为 `gb18030` 的情况下,TiDB 处理非法字符的方式与 MySQL 有如下区别:

- MySQL 处理非法 GB18030 字符集时,对读和写操作的处理方式不同。
Copy link
Collaborator

@hfxsd hfxsd Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要介绍分别如何处理的吗?或者给个 MySQL 相关内容的链接。

sql-statements/sql-statement-import-into.md Outdated Show resolved Hide resolved
Co-authored-by: xixirangrang <hfxsd@hotmail.com>
@hfxsd hfxsd requested a review from Frank945946 January 13, 2025 01:54

TiDB 和 MySQL 在 `gb18030` 字符集的默认排序规则上存在差异,具体如下:

- TiDB 的 `gb18030` 字符集的默认排序规则为 `gb18030_bin`。
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: 更改默认值描述

| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+-------------+---------+-----+---------+----------+---------+---------------+
| gb18030_bin | gb18030 | 249 | Yes | Yes | 1 | PAD SPACE |
+-------------+---------+-----+---------+----------+---------+---------------+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CbcWestwolf 这里的结果是不是漏了 ”gb18030_chinese_ci“ 这个 collation ?

- MySQL 的 `gb18030` 字符集的默认排序规则为 `gb18030_chinese_ci`。
- TiDB 支持的 `gb18030_bin` 与 MySQL 支持的 `gb18030_bin` 排序规则实现不同,TiDB 通过将 `gb18030` 字符集转换为 `utf8mb4` 然后进行二进制排序。

如果要使 TiDB 兼容 MySQL GB18030 字符集的排序规则,你需要在首次初始化 TiDB 集群时,将 TiDB 配置项 [`new_collations_enabled_on_first_bootstrap`](/tidb-configuration-file.md#new_collations_enabled_on_first_bootstrap) 设置为 `true` 来开启[新的排序规则框架](/character-set-and-collation.md#新框架下的排序规则支持)。开启新的排序规则框架后,查看 GB18030 字符集对应的排序规则,可以看到 TiDB GB18030 默认排序规则已经切换为 `gb18030_chinese_ci`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CbcWestwolf @hfxsd 我看默认值就是 ture,对于新集群就不会有问题,是不是只需要强调低版本如果设置为了 false,升级到高版本需要手动设置这个配置项为 ture?

Comment on lines +44 to +45
- TiDB 的 `gb18030` 字符集的默认排序规则为 `gb18030_bin`。
- MySQL 的 `gb18030` 字符集的默认排序规则为 `gb18030_chinese_ci`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CbcWestwolf 这里为何不与 MySQL 保持一致?如果用户 MySQL 默认字符集是 gb18030 没有显式指定 collation,那迁移到 TiDB 就会变成 gb18030_bin,会不会有问题?能否保持一致的行为,来避免类似的问题?

Comment on lines +20 to +24
+-----------+---------+----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+-----------+---------+----+---------+----------+---------+---------------+
| gbk_bin | gbk | 87 | Yes | Yes | 1 | PAD SPACE |
+-----------+---------+----+---------+----------+---------+---------------+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是也漏了返回 gbk_chinese_ci 这个collation 结果?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. translation/doing This PR’s assignee is translating this PR. v9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants