docs/design: add proposal for full collations support #14574

bb7133 · 2020-01-24T15:01:30Z

What problem does this PR solve?

This PR adds a proposal for full collations support in TiDB.

An issue has been created for the discussions on this proposal:

#14573

What is changed and how it works?

NA

Check List

NA

Release Note

No release note

wwar · 2020-01-24T16:08:29Z

docs/design/2020-01-24-collations.md

+const (
+    FlagIgnoreCase uint64=1 << iota
+    FlagIgnoreAccent
+    FlagPadding


If you don't ignore accent, what is the defined order of the accents - byte order? The sorting order may be defined (for example Canadian French and French sort accents in the opposite order.. but there is likely a more useful example where this matters).

UCA implements this by having a Character Weight, Accent Weight, Case Weight, and Punctuation Weight.

If you don't ignore accent, what is the defined order of the accents - byte order?

I think the order can be defined by the raw bytes/codepoint/accent weight of UCA, it depends on the demands.

docs/design/2020-01-24-collations.md

wwar · 2020-01-24T16:15:01Z

docs/design/2020-01-24-collations.md

+      Value: indexedColumnsValue
+	```
+
+Pros: Extra table lookup is avoided and no change to the TIDB optimizer.


This also has the pro that it is able to scan just the index in order to upgrade the on-disk storage to fix bugs in collations, because it has the sortkey and the value in the index. But this is not the method that MySQL uses (Option 3)

Got your point, thanks!

But this is not the method that MySQL uses (Option 3)

For MySQL, the on-disk storage is about the engine. So far as I know, InnoDB stores collations in metadata only since data is organized in pages, while MyRocks face the similar encoding issues like TiDB, and they have limitations on collations of index:

To support memcomparable format, MyRocks restricts collations on indexed columns -- only binary, latin1_bin and utf8_bin collations are supported.

docs/design/2020-01-24-collations.md

wwar · 2020-01-24T16:23:13Z

docs/design/2020-01-24-collations.md

+  3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster.
+  4. All replications/backups/recovers through TiDB tools like Binlog-related mechanism, CDC and BR should not encounter unexpected errors caused from the changes of collations in TiDB. For example, 
+    - When using BR, if backup data from old cluster is applied to the new cluster, the new cluster should know that the data is binary collated without padding, or otherwise 'Duplicate entry' error may be reported for primary key/unique key columns.
+    - If we don't allow both old collations and new collations exist in a TiDB cluster(the Option 4, see below), trying to make replications/backup & recovery between old/new clusters will break this prerequisite.


I think there should be a requirement that you can fix bugs in a collation. Because indexed values affect sort order, MySQL is not able to fix a collation ever. They instead need to create a new collation, and migration is painful. You should expect that there are bugs in the implementation, and have a plan to be able to scan through and upgrade incorrect values.

Thank you, I think this can be resolved by marking an internal version in TiDB's collation metadata.

I will try to add some descriptions about this.

Plus 1 on this.

I think this can be resolved by marking an internal version in TiDB's collation metadata.

The real difficulty here is how to cope with index, should we need to regenerate index?

Of course, index reorgnization seems to be the only way to fix it.

If we can fix bug by this, why we still introduce concept 0900? We can almost use the lastest UCA(now is 10.0), and take the upgradation as a big bug fix. For example, we can use utf8mb4_ai_ci and don't expose the version to user.

For example, we can use utf8mb4_ai_ci and don't expose the version to user.

The key part of this is the upgrading: when we upgrade the internal version of utf8mb4_ai_ci, the existing on-disk data must be updated too. I've updated the RFC on the implementation of this, PTAL

@bb7133 Then what‘s the problem with utf8mb4_ai_ci without the version?

I just mean that we may have bugs in the implementation and write the wrong data on disk.

wwar · 2020-01-24T16:34:25Z

docs/design/2020-01-24-collations.md

+All strings in MySQL are with implicit collations, if they're not explicitly specified, so we may need to check/update all codes related to strings comparison/lookup:
+
+  - Encoding/Decoding: update codec related functions in `codec`/`tablecodec` package.
+  - SQL Runtime: update like/regex/string comparison related functions in `expression` package.
+  - SQL Optimizer: If option 1 for encodings is chosen, the optimizer should build the extra  table lookup plans.
+  - DDL/Schema: check if the collation is old/new from the versions of tables/columns.
+  - Misc
+      + update string comparison related functions in `util` package.
+      + check if range functions defined in table partitions work with collations.


It would be nice if this explicitly described which collation features will be supported by the framework. For example:

Primary Weight i.e. character (yes)

Secondary Weight i.e. accent (current proposal says a shortcut will be implemented, as a boolean for sensitivity)

Tertiary Weight, i.e. case (current proposal says a shortcut will be implemented, as a boolean for sensitivity)

Quaternary Weight, i.e. punctuation weight (not supported)

PAD / NOPAD (maybe)

Reorder and inheritance rules (??)

The MySQL blog posts give a good overview of what these features are:
http://mysqlserverteam.com/new-collations-in-mysql-8-0-0/
http://mysqlserverteam.com/mysql-8-0-1-accent-and-case-sensitive-collations-for-utf8mb4/
http://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details/

Thank you, I will try to add those features in this proposal.

PAD / NOPAD

This will be supported, we've some issues related to it already, like #13664

Reorder and inheritance rules (??)

Currently, we don't have a plan for this.

wwar · 2020-01-24T16:55:19Z

docs/design/2020-01-24-collations.md

+In MySQL 8, the default collation of `utf8mb4` has been changed to `utf8mb4_900_ai_ci`, in which the Unicode version is upgraded to 9.0, with full UCA implementation.
+
+Should TiDB aim to support all collation of MySQL?
+  * If TiDB supports the latest collations only(for example, support `utf8mb4_900_ai_ci` only and treat `utf8mb4_general_ci` as one of its alias). This seems to be 'standard-correct' but may lead to potential compatibility issues.


I would recommend this path. Keep in mind why MySQL can not do this:

The collation affects the on-disk format. It can't just change meta data to upgrade users to a new version, it needs to upgrade the tables (collations are effectively a data type on indexed values).

The upgrade can not be guaranteed to be unattended. Due to both bugs, and differences unique key violations may occur.

These will make migrations of edge-cases harder, but applications rarely depend on the behaviour of the edge-cases. Bug-for-bug compatibility in datatypes is really hardwork, so I suspect even with best intentions (and all collations implemented) this case will still exist where unique constraints errors will be triggered during migrations incorrectly.

An automated way for the DM framwork to handle this, would be to defer creating unique indexes, and first scan and report any violations. They are usually small enough that an operator can fix them - and may want to in the source database.

Thanks for the recommendations, especially the idea of uniqueness-detect for DM framework.

My only concern is that some behaviors may not be detected by the on-storage data scans. For example, the potential changes of application queries, but that may be be fine...as you mentioned.

Since many of TiDB's users expect to migrate from MySQL without any upper-level change, the basic principle is that TiDB provides as much compatibility as possible with MySQL, which is, most of the time we have to pay the price for the historical burden of MySQL(like utf8 and utf8mb4).

I don't want general_ci, and after reading the doc of Oracle I think we could do better than MySQL.
The reason that I don't want general_ci is:
https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci
and
https://stackoverflow.com/questions/1036454/what-are-the-differences-between-utf8-general-ci-and-utf8-unicode-ci

wwar · 2020-01-24T19:20:06Z

I have an alternative proposal to full collations support, which is to first implement weight_string for utf8mb4_900_ai_ci + functional indexes.

Here is an example in MySQL 8.0.19 which has both:

CREATE TABLE pets (
 id INT NOT NULL PRIMARY KEY auto_increment,
 animaltype VARCHAR(50),
 INDEX ((WEIGHT_STRING(animaltype)))
);

INSERT INTO pets (animaltype) VALUES ('doG'), ('CAT'), ('FrOG'), ('ROCK');
SELECT * FROM pets WHERE WEIGHT_STRING(animaltype)=weight_string('dog');
EXPLAIN SELECT * FROM pets WHERE WEIGHT_STRING(animaltype)=weight_string('dog');

..

mysql8019> SELECT * FROM pets WHERE WEIGHT_STRING(animaltype)=weight_string('dog');
+----+------------+
| id | animaltype |
+----+------------+
|  1 | doG        |
+----+------------+
1 row in set (0.00 sec)

mysql8019> EXPLAIN SELECT * FROM pets WHERE WEIGHT_STRING(animaltype)=weight_string('dog');
+----+-------------+-------+------------+------+------------------+------------------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys    | key              | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------+------------+------+------------------+------------------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | pets  | NULL       | ref  | functional_index | functional_index | 203     | const |    1 |   100.00 | Using where |
+----+-------------+-------+------------+------+------------------+------------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql8019> SHOW WARNINGS;
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                          |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select `test`.`pets`.`id` AS `id`,`test`.`pets`.`animaltype` AS `animaltype` from `test`.`pets` where (weight_string(`test`.`pets`.`animaltype`) = <cache>(weight_string('dog'))) |
+-------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

This is not native support, but weight_string is the more correct implementation of sortKey as written in this proposal. Implementing it first will allow easier debugging, since strings can be compared against MySQL for correctness.

Ishiihara · 2020-01-27T20:28:14Z

docs/design/2020-01-24-collations.md

+    - Full UCA implementation for some Unicode collations, for example: `utf8mb4_900_ai_ci`.
+
+## Proposal
+


Before diving into the details, can we first have a summary of the changes we want to make? This doesn't have to be long, just a few sentences talking about changes we want to make and why. The intention of this comment is to make the proposal easier to pick up by community developers.

Got it, and this is the reason you advised putting the 'Implementation' chapter ahead.

Ishiihara · 2020-01-27T20:45:02Z

docs/design/2020-01-24-collations.md

+  3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster.
+  4. All replications/backups/recovers through TiDB tools like Binlog-related mechanism, CDC and BR should not encounter unexpected errors caused from the changes of collations in TiDB. For example, 
+    - When using BR, if backup data from old cluster is applied to the new cluster, the new cluster should know that the data is binary collated without padding, or otherwise 'Duplicate entry' error may be reported for primary key/unique key columns.
+    - If we don't allow both old collations and new collations exist in a TiDB cluster(the Option 4, see below), trying to make replications/backup & recovery between old/new clusters will break this prerequisite.


Plus 1 on this.

Ishiihara · 2020-01-27T20:46:36Z

docs/design/2020-01-24-collations.md

+
+The encoding layout of TiDB has been described in our [previous article](https://pingcap.com/blog/2017-07-11-tidbinternal2/#map). The row format should be changed to make it memory comparable, this is important to the index lookup. Basic principle is that all keys encoded should use the `sortKeys` result from `Key()`/`KeyFromString()` function. However, most of the `sortKeys` calculations are not reversible.
+
+#### Option 1


What are some decision criteria for choosing option 1 vs option 2? Here are a few points I can think of

Long term roadmap for collation

Engineering effort

Bug fixing difficulty

Migration effort

Upgrade effort

Thanks.

I think it's about performance and engineering. Opinion 2 is a little bit easier to implement while requires a new encoding layout. Both options have potential performance regressions, but looks that the issue of Option 1 is more serious.

For the points you listed, most of them(1, 3, 4 and 5) are the same for both options.

Ishiihara · 2020-01-27T21:10:05Z

docs/design/2020-01-24-collations.md

+
+Cons: Since existing TiDB cluster can't get new collations enabled, requirement 3 is eliminated; Requirement 4 is not met.
+
+## Implementation


How about we move this section to the beginning of the proposal section?

IMHO it's better to keep the alternatives firstly, and then implementation details.

I've made the change, there will be a TODO list for progress tracking as well as the guidance for the community.

bb7133 · 2020-01-28T06:48:04Z

/run-all-tests

docs/design/2020-01-24-collations.md

wjhuang2016 · 2020-01-29T12:46:01Z

docs/design/2020-01-24-collations.md

+    StringFromKey(str []byte) string
+}
+```
+The interface is quite similar to the Go [collate package](https://godoc.org/golang.org/x/text/collate), it is possible to implement some of the MySQL collation by simply mapping it to an existing Go collation.


Note: The implement of collate in GO use UCA 620, and MySQL newest's collations use UCA 900.

So can we use go collate package?

UCA framework doesn't change between the versions, so differences can be resolved by forking the collate package and change the lookup table.

The issue is only some of the MySQL collations are implemented by standard UCA. For example, utf8mb4_general_ci is not.

UCA framework doesn't change between the versions

According to https://unicode.org/reports/tr10/#Migration：

From 620 to 630:

The IgnoreSP option for variable weighted characters has been removed. Implementers of this option may instead refer to CLDR Shifted behavior.

I think the implement may change between versions.

docs/design/2020-01-24-collations.md

wjhuang2016 · 2020-01-29T17:20:01Z

docs/design/2020-01-24-collations.md

+In MySQL 8, the default collation of `utf8mb4` has been changed to `utf8mb4_900_ai_ci`, in which the Unicode version is upgraded to 9.0, with full UCA implementation.
+
+Should TiDB aim to support all collation of MySQL?
+  * If TiDB supports the latest collations only(for example, support `utf8mb4_900_ai_ci` only and treat `utf8mb4_general_ci` as one of its alias). This seems to be 'standard-correct' but may lead to potential compatibility issues.


I don't want general_ci, and after reading the doc of Oracle I think we could do better than MySQL.
The reason that I don't want general_ci is:
https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci
and
https://stackoverflow.com/questions/1036454/what-are-the-differences-between-utf8-general-ci-and-utf8-unicode-ci

docs/design/2020-01-24-collations.md

breezewish · 2020-02-02T15:37:13Z

docs/design/2020-01-24-collations.md

+
+Pros: Requirement 1.a, 1.b, 2 and 3 are met.
+
+Cons: Requirement 1.c and 4 are not met, the namings are also confusing: collations that are named from MySQL are different from MySQL, but the ones named different from MySQL behave the same as MySQL.


Requirement 2 is not met, since new users need to add tidb_ prefix to actually use specified collations.

Got it, thanks.

breezewish · 2020-02-02T15:45:33Z

docs/design/2020-01-24-collations.md

+  * Primary Weight i.e. character
+  * Secondary Weight i.e. accent
+  * Tertiary Weight i.e. case
+  * PAD / NOPAD


According to https://godoc.org/golang.org/x/text/collate#Option there is no option for PAD / NOPAD. Do we need to implement the padding behaviour it by our own?

Yes, I think so.

docs/design/2020-01-24-collations.md

breezewish · 2020-02-02T15:54:22Z

docs/design/2020-01-24-collations.md

+
+We have an opportunity to do a much better job than MySQL has done.
+
+A collation, logically, should not be a property of a character set. Since all character sets are subsets of Unicode, collations can apply to Unicode codepoints without needing to be tied to individual character sets.


This might not be true. For example, users may want to store data in GBK encoding while be able to compare by Chinese characters. GBK encoding lead to less storage space compared to Unicode encoding when storing Chinese characters.

wjhuang2016 · 2020-02-03T05:22:24Z

Here is my proposal for linguistic collation(corresponding to binary collation), is not completed yet:
We support collations like:
(1) utf8mb4_locale_strength_caseLevel_...., see also mongo and Oracle and DB2
Or
(2) utf8mb4_locale_ai_ci
For option 1, we can provide more powerful collation, but it's more complex to implement.
For option 2, it can use Go's collate package in TiDB, but still may need to build the corresponding in TiKV(icu-binding?).
Note:
There is not an option "version", We try to automatically upgrade it. See also Semantic Versioning.

wjhuang2016 · 2020-02-03T09:50:30Z

We have a internal meeting afternoon, and here is the meeting notes. Please feel free to comment on it.

The implementation of utf8mb4_general_ci and utf8_general_ci in TiDB (compatible with MySQL or standard UCA)?

Discussion:
The general_ ci is not a standard unicode collation implementation. In some cases that UCA may think that some are equal where general ci may give the a different result.
Migrating from MySQL to TiDB may cause applications to throw errors for users.
Needs to meet the timeline of one of TiDB user and we need to prioritize that.
The existing go collate library does not support UCA 9.0, with full UCA implementation.(Still need to investigate)

Decision:
Compatible with MySQL general_ci implementation and add full collation later.(That is, support general_ci in 4.0)

Which KV encoding options should be used?

Discussion:
Performance vs engineering difficulty.
Needs to replace Value with SortKey in the indexed key.
Option 2: keep the original index value in the value.
Option 2 might eliminate one optimization in TiKV. Doesn’t prefer to add value to key.
The encoding should support version and extensible to add more column values to support covering index. Add rowid after the values.
As for virtual generated columns, we use index to store its data so that we don’t need to compute anymore. Option 1 eliminates this optimization.
In option 2, where to place the column value? Key or Value?

Decision:
Choose option 2
Place the column value in Value.
AI:
@coocood , @bb7133 to provide detailed encoding for option2 to support extensions.

Compatibility Options

Discussion:
Not all the compatibility options can be satisfied.
Requirement 3 must be met.
We can support option 4 first. Existing customers can upgrade to 4.0 but might not support new collation.

Decision:
Choose option 4. Use a config to control whether to use the new collation.
Keep the default behavior as the old TiDB behavior.
Later we can implement option 1 or option 2 or option 3 to make the migration smooth.
Also add label “/* AS_BINARY_COLLATION */” for old tables. （In 4.0）

Resource Allocation
TiKV: @breeswish
TiDB: @wjhuang2016 , @qw4990 or someone familiar with expression.

tiancaiamao · 2020-02-06T04:04:44Z

/run-all-tests

tiancaiamao · 2020-02-06T06:11:40Z

/run-all-tests

kennytm · 2020-05-07T07:47:08Z

docs/design/2020-01-24-collations.md

+  2. Users of MySQL who requires collations can move to TiDB without modifying their scripts/applications.
+  3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster.
+  4. All replications/backups/recovers through TiDB tools like Binlog-related mechanism, CDC and BR should not encounter unexpected errors caused from the changes of collations in TiDB. For example, 
+    - When using BR, if backup data from old cluster is applied to the new cluster, the new cluster should know that the data is binary collated without padding, or otherwise 'Duplicate entry' error may be reported for primary key/unique key columns.


BR works at KV level. There is no "duplicate entry" error when working at KV level.

In fact BR does not decode the keys and values beyond the table ID, so by changing the index encoding we can be sure that a table restored through BR with different new_collation_enabled setting would have broken index.

BR plans to reject restore with when new_collation_enabled is different, until the "Bug-fixing" part is implemented (ColumnInfoVersion & ALTER TABLE TIDB UPGRADE COLLATION). Is this acceptable @bb7133?

BR plans to reject restore with when new_collation_enabled is different, until the "Bug-fixing" part is implemented (ColumnInfoVersion & ALTER TABLE TIDB UPGRADE COLLATION). Is this acceptable ?

Yes, of course...

kennytm · 2020-05-07T08:07:14Z

docs/design/2020-01-24-collations.md

+    - b. After upgrade, the behavior of newly-created tables remains unchanged.
+    - c. If a TiDB cluster is replicating the data to a MySQL instance through Binlog, all  replications remain unchanged.
+  2. Users of MySQL who requires collations can move to TiDB without modifying their scripts/applications.
+  3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster.


Are we planning to make it possible to have tables with "old collation" and "new collation" coexist on the same cluster?

The current implementation reads the collation info on bootstrap from the mysql.tidb table, is that value supposed to be mutable?

Are we planning to make it possible to have tables with "old collation" and "new collation" coexist on the same cluster?

Yes, but currently we don't have a detailed roadmap for it.

sre-bot · 2020-09-02T03:49:10Z

No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md

wjhuang2016 · 2020-09-02T03:50:25Z

docs/design/2020-01-24-collations.md

+
+## Open issues
+
+- https://github.com/pingcap/tidb/issues/222


Need to change here.

sre-bot · 2020-10-14T04:13:49Z

No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md

wjhuang2016 · 2020-10-14T04:17:06Z

docs/design/2020-01-24-collations.md

+tidb> insert into t values ('a');
+Query OK, 1 row affected // Should report error "Duplicate entry 'a'"
+```
+In the case above, the user creates a column with a case-insensitive collation `utf8mb4_general_ci`. Since the 'real' collation of TiDB is always `utf8mb4_bin`, inserting data into the primary key with "A" and "a" does not lead to an error.


Since the 'real' collation of TiDB is always utf8mb4_bin,
s/utf8mb4_bin/binary?

sre-bot · 2020-10-14T05:26:43Z

No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md

wjhuang2016

LGTM

xiongjiwei · 2020-10-20T06:12:26Z

LGTM

bb7133 · 2020-10-20T06:14:57Z

/merge

bb7133 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. status/WIP proposal component/docs and removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Jan 24, 2020

This was referenced Jan 24, 2020

charset: full collations support for TiDB #14573

Closed

Support Case Insensitive Collations #7519

Closed

bb7133 force-pushed the bb7133/add_collations_rfc branch from c452457 to 9506ce5 Compare January 24, 2020 15:12

wwar reviewed Jan 24, 2020

View reviewed changes

Ishiihara reviewed Jan 27, 2020

View reviewed changes

wjhuang2016 reviewed Jan 29, 2020

View reviewed changes

docs/design/2020-01-24-collations.md Outdated Show resolved Hide resolved

wjhuang2016 reviewed Jan 29, 2020

View reviewed changes

winkyao reviewed Jan 30, 2020

View reviewed changes

docs/design/2020-01-24-collations.md Outdated Show resolved Hide resolved

winkyao reviewed Jan 30, 2020

View reviewed changes

docs/design/2020-01-24-collations.md Outdated Show resolved Hide resolved

winkyao reviewed Jan 30, 2020

View reviewed changes

docs/design/2020-01-24-collations.md Outdated Show resolved Hide resolved

winkyao reviewed Jan 30, 2020

View reviewed changes

docs/design/2020-01-24-collations.md Outdated Show resolved Hide resolved

bb7133 force-pushed the bb7133/add_collations_rfc branch from 1af88f8 to 274775d Compare January 30, 2020 08:23

breezewish reviewed Feb 2, 2020

View reviewed changes

This was referenced Feb 4, 2020

charset: define collation interface in TiDB #14628

Closed

charset: implement collation interface for utf8_general_ci and utf8mb4_general_ci in TiDB #14630

Closed

This was referenced Feb 6, 2020

charset: make CompareString support Collation #14657

Closed

charset: make hashRowContainer support Collation #14659

Closed

charset: make vecGroupChecker support Collation #14660

Closed

TennyZhuang mentioned this pull request Feb 24, 2020

copr: support new index encoding with collation support tikv/tikv#6685

Merged

kennytm reviewed May 7, 2020

View reviewed changes

SunRunAway modified the milestones: v4.0.0-beta.1, v4.0.1 May 28, 2020

zz-jason removed this from the v4.0.1 milestone Jun 5, 2020

wjhuang2016 reviewed Sep 2, 2020

View reviewed changes

docs/design/2020-01-24-collations.md

## Open issues

- https://github.com/pingcap/tidb/issues/222

Copy link

Member

wjhuang2016 Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change here.

bb7133 added 5 commits October 14, 2020 12:13

docs/design: add proposal for full collations support

5da5c0b

address comments from wwar, Ishiihara and wjhuang2016

287d1e3

address winkyao's comment

875eb16

more updates

ce11a8f

update the doc according to previous discussions.

1646782

bb7133 force-pushed the bb7133/add_collations_rfc branch from 45d0969 to db3ee98 Compare October 14, 2020 04:13

wjhuang2016 reviewed Oct 14, 2020

View reviewed changes

address comments

7d015d5

bb7133 force-pushed the bb7133/add_collations_rfc branch from db3ee98 to 7d015d5 Compare October 14, 2020 05:26

wjhuang2016 reviewed Oct 15, 2020

View reviewed changes

ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 15, 2020

Merge branch 'master' into bb7133/add_collations_rfc

7f5abcf

pingcap deleted a comment from ti-srebot Oct 20, 2020

bb7133 added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Oct 20, 2020

wjhuang2016 approved these changes Oct 20, 2020

View reviewed changes

Merge branch 'master' into bb7133/add_collations_rfc

d0d3e62

bb7133 merged commit 0522f65 into pingcap:master Oct 20, 2020

bb7133 deleted the bb7133/add_collations_rfc branch December 29, 2023 18:45

		- Full UCA implementation for some Unicode collations, for example: `utf8mb4_900_ai_ci`.

		## Proposal


		The encoding layout of TiDB has been described in our [previous article](https://pingcap.com/blog/2017-07-11-tidbinternal2/#map). The row format should be changed to make it memory comparable, this is important to the index lookup. Basic principle is that all keys encoded should use the `sortKeys` result from `Key()`/`KeyFromString()` function. However, most of the `sortKeys` calculations are not reversible.

		#### Option 1


		Cons: Since existing TiDB cluster can't get new collations enabled, requirement 3 is eliminated; Requirement 4 is not met.

		## Implementation


		Pros: Requirement 1.a, 1.b, 2 and 3 are met.

		Cons: Requirement 1.c and 4 are not met, the namings are also confusing: collations that are named from MySQL are different from MySQL, but the ones named different from MySQL behave the same as MySQL.


		We have an opportunity to do a much better job than MySQL has done.

		A collation, logically, should not be a property of a character set. Since all character sets are subsets of Unicode, collations can apply to Unicode codepoints without needing to be tied to individual character sets.

docs/design: add proposal for full collations support #14574

docs/design: add proposal for full collations support #14574

Conversation

bb7133 commented Jan 24, 2020 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release Note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bb7133 Jan 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bb7133 Jan 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wwar commented Jan 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bb7133 commented Jan 28, 2020

wjhuang2016 Jan 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjhuang2016 commented Feb 3, 2020 • edited Loading

wjhuang2016 commented Feb 3, 2020

tiancaiamao commented Feb 6, 2020

tiancaiamao commented Feb 6, 2020

kennytm May 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sre-bot commented Sep 2, 2020

Choose a reason for hiding this comment

sre-bot commented Oct 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sre-bot commented Oct 14, 2020

wjhuang2016 left a comment

Choose a reason for hiding this comment

xiongjiwei commented Oct 20, 2020

bb7133 commented Oct 20, 2020

bb7133 commented Jan 24, 2020 •

edited

Loading

bb7133 Jan 27, 2020 •

edited

Loading

bb7133 Jan 27, 2020 •

edited

Loading

wwar commented Jan 24, 2020 •

edited

Loading

wjhuang2016 Jan 29, 2020 •

edited

Loading

wjhuang2016 commented Feb 3, 2020 •

edited

Loading

kennytm May 7, 2020 •

edited

Loading