-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/design: add proposal for full collations support #14574
Conversation
c452457
to
9506ce5
Compare
docs/design/2020-01-24-collations.md
Outdated
const ( | ||
FlagIgnoreCase uint64=1 << iota | ||
FlagIgnoreAccent | ||
FlagPadding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't ignore accent, what is the defined order of the accents - byte order? The sorting order may be defined (for example Canadian French and French sort accents in the opposite order.. but there is likely a more useful example where this matters).
UCA implements this by having a Character Weight, Accent Weight, Case Weight, and Punctuation Weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't ignore accent, what is the defined order of the accents - byte order?
I think the order can be defined by the raw bytes/codepoint/accent weight of UCA, it depends on the demands.
docs/design/2020-01-24-collations.md
Outdated
Value: indexedColumnsValue | ||
``` | ||
|
||
Pros: Extra table lookup is avoided and no change to the TIDB optimizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also has the pro that it is able to scan just the index in order to upgrade the on-disk storage to fix bugs in collations, because it has the sortkey and the value in the index. But this is not the method that MySQL uses (Option 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point, thanks!
But this is not the method that MySQL uses (Option 3)
For MySQL, the on-disk storage is about the engine. So far as I know, InnoDB stores collations in metadata only since data is organized in pages, while MyRocks face the similar encoding issues like TiDB, and they have limitations on collations of index:
To support memcomparable format, MyRocks restricts collations on indexed columns -- only binary, latin1_bin and utf8_bin collations are supported.
3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster. | ||
4. All replications/backups/recovers through TiDB tools like Binlog-related mechanism, CDC and BR should not encounter unexpected errors caused from the changes of collations in TiDB. For example, | ||
- When using BR, if backup data from old cluster is applied to the new cluster, the new cluster should know that the data is binary collated without padding, or otherwise 'Duplicate entry' error may be reported for primary key/unique key columns. | ||
- If we don't allow both old collations and new collations exist in a TiDB cluster(the Option 4, see below), trying to make replications/backup & recovery between old/new clusters will break this prerequisite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a requirement that you can fix bugs in a collation. Because indexed values affect sort order, MySQL is not able to fix a collation ever. They instead need to create a new collation, and migration is painful. You should expect that there are bugs in the implementation, and have a plan to be able to scan through and upgrade incorrect values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think this can be resolved by marking an internal version in TiDB's collation metadata.
I will try to add some descriptions about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus 1 on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be resolved by marking an internal version in TiDB's collation metadata.
The real difficulty here is how to cope with index
, should we need to regenerate index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, index reorgnization seems to be the only way to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can fix bug by this, why we still introduce concept 0900
? We can almost use the lastest UCA(now is 10.0), and take the upgradation as a big bug fix. For example, we can use utf8mb4_ai_ci
and don't expose the version to user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, we can use utf8mb4_ai_ci and don't expose the version to user.
The key part of this is the upgrading: when we upgrade the internal version of utf8mb4_ai_ci
, the existing on-disk data must be updated too. I've updated the RFC on the implementation of this, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bb7133 Then what‘s the problem with utf8mb4_ai_ci
without the version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean that we may have bugs in the implementation and write the wrong data on disk.
docs/design/2020-01-24-collations.md
Outdated
All strings in MySQL are with implicit collations, if they're not explicitly specified, so we may need to check/update all codes related to strings comparison/lookup: | ||
|
||
- Encoding/Decoding: update codec related functions in `codec`/`tablecodec` package. | ||
- SQL Runtime: update like/regex/string comparison related functions in `expression` package. | ||
- SQL Optimizer: If option 1 for encodings is chosen, the optimizer should build the extra table lookup plans. | ||
- DDL/Schema: check if the collation is old/new from the versions of tables/columns. | ||
- Misc | ||
+ update string comparison related functions in `util` package. | ||
+ check if range functions defined in table partitions work with collations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this explicitly described which collation features will be supported by the framework. For example:
- Primary Weight i.e. character (yes)
- Secondary Weight i.e. accent (current proposal says a shortcut will be implemented, as a boolean for sensitivity)
- Tertiary Weight, i.e. case (current proposal says a shortcut will be implemented, as a boolean for sensitivity)
- Quaternary Weight, i.e. punctuation weight (not supported)
- PAD / NOPAD (maybe)
- Reorder and inheritance rules (??)
The MySQL blog posts give a good overview of what these features are:
http://mysqlserverteam.com/new-collations-in-mysql-8-0-0/
http://mysqlserverteam.com/mysql-8-0-1-accent-and-case-sensitive-collations-for-utf8mb4/
http://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I will try to add those features in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PAD / NOPAD
This will be supported, we've some issues related to it already, like #13664
Reorder and inheritance rules (??)
Currently, we don't have a plan for this.
docs/design/2020-01-24-collations.md
Outdated
In MySQL 8, the default collation of `utf8mb4` has been changed to `utf8mb4_900_ai_ci`, in which the Unicode version is upgraded to 9.0, with full UCA implementation. | ||
|
||
Should TiDB aim to support all collation of MySQL? | ||
* If TiDB supports the latest collations only(for example, support `utf8mb4_900_ai_ci` only and treat `utf8mb4_general_ci` as one of its alias). This seems to be 'standard-correct' but may lead to potential compatibility issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend this path. Keep in mind why MySQL can not do this:
- The collation affects the on-disk format. It can't just change meta data to upgrade users to a new version, it needs to upgrade the tables (collations are effectively a data type on indexed values).
- The upgrade can not be guaranteed to be unattended. Due to both bugs, and differences unique key violations may occur.
These will make migrations of edge-cases harder, but applications rarely depend on the behaviour of the edge-cases. Bug-for-bug compatibility in datatypes is really hardwork, so I suspect even with best intentions (and all collations implemented) this case will still exist where unique constraints errors will be triggered during migrations incorrectly.
An automated way for the DM framwork to handle this, would be to defer creating unique indexes, and first scan and report any violations. They are usually small enough that an operator can fix them - and may want to in the source database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the recommendations, especially the idea of uniqueness-detect for DM framework.
My only concern is that some behaviors may not be detected by the on-storage data scans. For example, the potential changes of application queries, but that may be be fine...as you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since many of TiDB's users expect to migrate from MySQL without any upper-level change, the basic principle is that TiDB provides as much compatibility as possible with MySQL, which is, most of the time we have to pay the price for the historical burden of MySQL(like utf8
and utf8mb4
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want general_ci
, and after reading the doc of Oracle I think we could do better than MySQL.
The reason that I don't want general_ci
is:
https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci
and
https://stackoverflow.com/questions/1036454/what-are-the-differences-between-utf8-general-ci-and-utf8-unicode-ci
I have an alternative proposal to full collations support, which is to first implement weight_string for Here is an example in MySQL 8.0.19 which has both:
This is not native support, but |
- Full UCA implementation for some Unicode collations, for example: `utf8mb4_900_ai_ci`. | ||
|
||
## Proposal | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before diving into the details, can we first have a summary of the changes we want to make? This doesn't have to be long, just a few sentences talking about changes we want to make and why. The intention of this comment is to make the proposal easier to pick up by community developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, and this is the reason you advised putting the 'Implementation' chapter ahead.
3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster. | ||
4. All replications/backups/recovers through TiDB tools like Binlog-related mechanism, CDC and BR should not encounter unexpected errors caused from the changes of collations in TiDB. For example, | ||
- When using BR, if backup data from old cluster is applied to the new cluster, the new cluster should know that the data is binary collated without padding, or otherwise 'Duplicate entry' error may be reported for primary key/unique key columns. | ||
- If we don't allow both old collations and new collations exist in a TiDB cluster(the Option 4, see below), trying to make replications/backup & recovery between old/new clusters will break this prerequisite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus 1 on this.
docs/design/2020-01-24-collations.md
Outdated
|
||
The encoding layout of TiDB has been described in our [previous article](https://pingcap.com/blog/2017-07-11-tidbinternal2/#map). The row format should be changed to make it memory comparable, this is important to the index lookup. Basic principle is that all keys encoded should use the `sortKeys` result from `Key()`/`KeyFromString()` function. However, most of the `sortKeys` calculations are not reversible. | ||
|
||
#### Option 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are some decision criteria for choosing option 1 vs option 2? Here are a few points I can think of
- Long term roadmap for collation
- Engineering effort
- Bug fixing difficulty
- Migration effort
- Upgrade effort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I think it's about performance and engineering. Opinion 2 is a little bit easier to implement while requires a new encoding layout. Both options have potential performance regressions, but looks that the issue of Option 1 is more serious.
For the points you listed, most of them(1, 3, 4 and 5) are the same for both options.
|
||
Cons: Since existing TiDB cluster can't get new collations enabled, requirement 3 is eliminated; Requirement 4 is not met. | ||
|
||
## Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we move this section to the beginning of the proposal section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it's better to keep the alternatives firstly, and then implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the change, there will be a TODO list for progress tracking as well as the guidance for the community.
/run-all-tests |
docs/design/2020-01-24-collations.md
Outdated
StringFromKey(str []byte) string | ||
} | ||
``` | ||
The interface is quite similar to the Go [collate package](https://godoc.org/golang.org/x/text/collate), it is possible to implement some of the MySQL collation by simply mapping it to an existing Go collation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: The implement of collate
in GO use UCA 620, and MySQL newest's collations use UCA 900.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can we use go collate package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCA framework doesn't change between the versions, so differences can be resolved by forking the collate
package and change the lookup table.
The issue is only some of the MySQL collations are implemented by standard UCA. For example, utf8mb4_general_ci
is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCA framework doesn't change between the versions
According to https://unicode.org/reports/tr10/#Migration:
From 620 to 630:
The IgnoreSP option for variable weighted characters has been removed. Implementers of this option may instead refer to CLDR Shifted behavior.
I think the implement may change between versions.
docs/design/2020-01-24-collations.md
Outdated
In MySQL 8, the default collation of `utf8mb4` has been changed to `utf8mb4_900_ai_ci`, in which the Unicode version is upgraded to 9.0, with full UCA implementation. | ||
|
||
Should TiDB aim to support all collation of MySQL? | ||
* If TiDB supports the latest collations only(for example, support `utf8mb4_900_ai_ci` only and treat `utf8mb4_general_ci` as one of its alias). This seems to be 'standard-correct' but may lead to potential compatibility issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want general_ci
, and after reading the doc of Oracle I think we could do better than MySQL.
The reason that I don't want general_ci
is:
https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci
and
https://stackoverflow.com/questions/1036454/what-are-the-differences-between-utf8-general-ci-and-utf8-unicode-ci
1af88f8
to
274775d
Compare
docs/design/2020-01-24-collations.md
Outdated
|
||
Pros: Requirement 1.a, 1.b, 2 and 3 are met. | ||
|
||
Cons: Requirement 1.c and 4 are not met, the namings are also confusing: collations that are named from MySQL are different from MySQL, but the ones named different from MySQL behave the same as MySQL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requirement 2 is not met, since new users need to add tidb_
prefix to actually use specified collations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
* Primary Weight i.e. character | ||
* Secondary Weight i.e. accent | ||
* Tertiary Weight i.e. case | ||
* PAD / NOPAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://godoc.org/golang.org/x/text/collate#Option there is no option for PAD / NOPAD. Do we need to implement the padding behaviour it by our own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so.
|
||
We have an opportunity to do a much better job than MySQL has done. | ||
|
||
A collation, logically, should not be a property of a character set. Since all character sets are subsets of Unicode, collations can apply to Unicode codepoints without needing to be tied to individual character sets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be true. For example, users may want to store data in GBK encoding while be able to compare by Chinese characters. GBK encoding lead to less storage space compared to Unicode encoding when storing Chinese characters.
Here is my proposal for linguistic collation(corresponding to binary collation), is not completed yet: |
We have a internal meeting afternoon, and here is the meeting notes. Please feel free to comment on it. The implementation of Discussion: Decision: Which KV encoding options should be used? Discussion: Decision: Compatibility Options Discussion: Decision: Resource Allocation |
/run-all-tests |
1 similar comment
/run-all-tests |
2. Users of MySQL who requires collations can move to TiDB without modifying their scripts/applications. | ||
3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster. | ||
4. All replications/backups/recovers through TiDB tools like Binlog-related mechanism, CDC and BR should not encounter unexpected errors caused from the changes of collations in TiDB. For example, | ||
- When using BR, if backup data from old cluster is applied to the new cluster, the new cluster should know that the data is binary collated without padding, or otherwise 'Duplicate entry' error may be reported for primary key/unique key columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BR works at KV level. There is no "duplicate entry" error when working at KV level.
In fact BR does not decode the keys and values beyond the table ID, so by changing the index encoding we can be sure that a table restored through BR with different new_collation_enabled
setting would have broken index.
BR plans to reject restore with when new_collation_enabled
is different, until the "Bug-fixing" part is implemented (ColumnInfoVersion & ALTER TABLE TIDB UPGRADE COLLATION). Is this acceptable @bb7133?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BR plans to reject restore with when new_collation_enabled is different, until the "Bug-fixing" part is implemented (ColumnInfoVersion & ALTER TABLE TIDB UPGRADE COLLATION). Is this acceptable ?
Yes, of course...
- b. After upgrade, the behavior of newly-created tables remains unchanged. | ||
- c. If a TiDB cluster is replicating the data to a MySQL instance through Binlog, all replications remain unchanged. | ||
2. Users of MySQL who requires collations can move to TiDB without modifying their scripts/applications. | ||
3. Users of TiDB can distinguish the tables with old collation and new collations easily, if they both exist in a TiDB cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to make it possible to have tables with "old collation" and "new collation" coexist on the same cluster?
The current implementation reads the collation info on bootstrap from the mysql.tidb
table, is that value supposed to be mutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to make it possible to have tables with "old collation" and "new collation" coexist on the same cluster?
Yes, but currently we don't have a detailed roadmap for it.
No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md |
|
||
## Open issues | ||
|
||
- https://github.com/pingcap/tidb/issues/222 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to change here.
45d0969
to
db3ee98
Compare
No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md |
docs/design/2020-01-24-collations.md
Outdated
tidb> insert into t values ('a'); | ||
Query OK, 1 row affected // Should report error "Duplicate entry 'a'" | ||
``` | ||
In the case above, the user creates a column with a case-insensitive collation `utf8mb4_general_ci`. Since the 'real' collation of TiDB is always `utf8mb4_bin`, inserting data into the primary key with "A" and "a" does not lead to an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the 'real' collation of TiDB is always utf8mb4_bin
,
s/utf8mb4_bin/binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
db3ee98
to
7d015d5
Compare
No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM |
/merge |
What problem does this PR solve?
This PR adds a proposal for full collations support in TiDB.
An issue has been created for the discussions on this proposal:
#14573
What is changed and how it works?
NA
Check List
NA
Release Note