-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set utf8mb4 charset and collation on mysql kernel schema and fixtures #2277
Conversation
@alongosz wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be aware that *_unicode_520_ci
exists since MySQL 5.6. Our Doc states that we support MySQL 5.5 also. But maybe the Doc is outdated? // @andrerom
This makes me wonder how it could be achieved in a generic (cross-DBMS) way by Doctrine DBAL Schema Tool that will be used in the future.
Other found issue:
data/mysql/schema.sql
Outdated
@@ -1601,7 +1601,7 @@ CREATE TABLE `ezprest_authcode` ( | |||
`user_id` int(11) NOT NULL DEFAULT '0', | |||
PRIMARY KEY (`id`), | |||
KEY `authcode_client_id` (`client_id`) | |||
) ENGINE=InnoDB DEFAULT CHARSET=utf8; | |||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CREATE
on my 10.1.28-MariaDB
caused:
ERROR 1071 (42000) at line 1596: Specified key was too long; max key length is 767 bytes
AFAICS error exists also for some further CREATE
s, but the first one is here.
I'm not sure what's the solution here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 767 bytes limit is a hard limit on innodb indexes. So it seems we can't have indexes on varchars bigger than 191. (191 x 4 = 764).
https://stackoverflow.com/questions/1814532/1071-specified-key-was-too-long-max-key-length-is-767-bytes?answertab=votes#tab-top
But we can't just willy nilly truncate all indexed varchars to 191 either - many are 255 chars long. No idea why this doesn't break the tests, but anyway... dang. 😖
I confirm the error on local mysql.
It is not out of date per say, but if we merge it we will have to bump requirement to 5.6 for v2.1 (which is fine as long as we do it before stable). That said I'm unsure if we should hard code COLLATE here at all, depending on language people are planning to use they might want to use something else taken from system config. But unsure what is best here, hence the ping to you. Unsure why you get that error, did you find out more on why? |
If we don't specify collation I reckon it will default to utf8mb4_unicode_ci which is also okay(ish). If you try to use a 3-byte collation with 4-byte charset you'll get an error, so we don't have to worry about that at least. |
There are 6 cases of indexed varchars longer than 191. Last commit truncates these to 191, let's see what travis says about that. NB: I don't know if the affected REST tables can contain 4-byte chars, but the |
Doc for 5.5 doesn't contain any reference to
That would be the best approach, what needs to be checked is what is default here.
No. We can't truncate column lengths in general, it's a BC break. Workaround for that is introducing indices with specified length (we already have them). Not sure if it works for primary keys though, which is the case here also. |
Default for
The latter. If table charset is set but not collation, then the default collation for that charset is used.
Thanks! Done. It seems to be fine for primary keys too. |
ff6778b
to
b96f8a2
Compare
CREATE TABLE `ezsearch_search_phrase` ( | ||
`id` int(11) NOT NULL AUTO_INCREMENT, | ||
`phrase` varchar(250) DEFAULT NULL, | ||
`phrase_count` int(11) DEFAULT '0', | ||
`result_count` int(11) DEFAULT '0', | ||
PRIMARY KEY (`id`), | ||
UNIQUE KEY `ezsearch_search_phrase_phrase` (`phrase`), | ||
UNIQUE KEY `ezsearch_search_phrase_phrase` (`phrase` (191)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case specifically; given we introduce this in a new version where we don't so much care for BC.
I'd much rather make sure we use a size that works on the column on this case.
Why: this is search phrase, if code insert longer phrase then 191 it anyway means search on if won't really work beyond that as it won't be part of index. So unless I'm missing some thing can just as well change size of column and provide upgrade scripts for that.
Or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
means search on if won't really work beyond that as it won't be part of index
I think you're right. It may return multiple different results, if the first 191 chars of the phrases are the same. Agree @alongosz ?
Apropos, I should provide upgrade scripts for all the changes anyway - but we should agree on the changes first :)
Regardless of whether we shorten the column or the (unique) index, there is a risk that the update script for this will fail due to duplicate entries. In that case can't really do anything but tell people to delete the extra entries. In the case of the There is precendence for changing charset: In 4.0 we added a script for changing to utf8. It does remind me that we should also change the database connection charset, ref |
Hmm, @andrerom but you mean That said, I'm all in favor of schema changes via upgrade script for now. We just need to figure out a way to ensure users execute these scripts. I was thinking about hooking something into [1] So I'm not saying we shouldn't do it now, just pointing out your previous objections in such cases :) In this particular case I'm not really sure how to handle the case when update truncates the data, not to mention what @glye have said about index collisions. |
No, I meant 7.1 or 7.2. There is a difference here: 1. that PR was made before we decided to allow for legacy to change a bit so we could start to evolve features that affected both, 2. that change is more severe, and will need less adoption on legacy.
That would be DBAL schema migration, +1 to get there soon.
@glye True, so lets keep the change here light then and only opt for index length update. What are the remaining todo's here? (Also keep in mind we can safely do this for v2.2 instead, by now that is looking safer overall) |
Reverted column shortening. Moved to 2.2, and added upgrade doc. |
This should be |
@andrerom Corrected. |
0b516fd
to
fdb8311
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok now besides nitpick. But I'll hold off on +1 until 7.1 is out and branched of.
doc/upgrade/7.2.md
Outdated
|
||
## MySQL/MariaDB database tables character set change | ||
|
||
The character set for MySQL/MariaDB database tables is changed from `utf8` to `utf8mb4` to support 4-byte characters. See `data/update/mysql/dbupdate-6.12.0-to-7.2.0.sql` for the SQL upgrade statements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data/update/mysql/dbupdate-7.0.0-to-7.2.0.sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks. Just making sure if 7.0 is not a typo, since you said 7.1 before...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should be 7.1 :)
fdb8311
to
1db9cca
Compare
Updated according to last feedback. Cleaned up commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MERGE: Wait unti 7.1 is branched of
1db9cca
to
6923a98
Compare
``` | ||
Also make the corresponding change in `app/config/dfs/dfs.yml`. | ||
|
||
For legacy, in `ezpublish_legacy/settings/i18n.ini`, set the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emodric That Should Not Happen™ and didn't for me. Are you saying it expects the ini files to have utf8mb4
charset, and returns nothing because that charset doesn't exist in this context? Disturbing news :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure that's the case, but it does sound like it. The only outcome that I witnessed is that reading any INI variable just returns false
.
EDIT: Once I change all INI files to declare utf8mb4
charset, then legacy templates start acting out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glye What's the advantage of using i18n.ini
to set the charset, when we can use site.ini/DatabaseSettings/Charset
directly, to specify the DB charset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emodric I did it for reasons that elude me know. It does look like you're right.
Anyway, I tried it with only DatabaseSettings/Charset too, and also doesn't work :/ |
@emodric So, like this? https://github.com/ezsystems/ezpublish-kernel/blob/fbf605375324c89bce00309f9ff5bc78e7b49d61/doc/upgrade/7.2.md |
Yep, but with that config, actual queries fail, which somehow get decoded/encoded wrong, so garbage is set to the DB. |
@emodric You have run the queries to change table charset, right? |
Yes, ofcourse :) I will test once more thorughly and report any issues. Slack is okay? |
Just checking :) Yep, I'm on slack. |
We should use the
utf8mb4
mysql charset which supports all of unicode, instead ofutf8
which does not support 4-byte characters (which includes some Chinese, emojis, and other things).Note that for collation I selectedWe will leave collation for the site admins to decide.utf8mb4_unicode_520_ci
instead ofutf8mb4_unicode_ci
, because the latter treats all 4-byte characters as the same. Ref https://core.trac.wordpress.org/ticket/32105#comment:3TODO
Separate PRs for db charset in app/config/config.yml and legacy site.iniNo can do, not compatible with non-mysql dbs, note in upgrade doc instead