Set utf8mb4 charset and collation on mysql kernel schema and fixtures #2277

glye · 2018-03-14T17:04:22Z

Question	Answer
JIRA issue	EZP-28950
Bug fix	no
New feature	kind of
Target version	7.2
BC breaks	kind of (shorter indexes)
Tests pass	yes
Doc needed	yes (for changing existing databases)

We should use the utf8mb4 mysql charset which supports all of unicode, instead of utf8 which does not support 4-byte characters (which includes some Chinese, emojis, and other things).

Note that for collation I selected utf8mb4_unicode_520_ci instead of utf8mb4_unicode_ci, because the latter treats all 4-byte characters as the same. Ref https://core.trac.wordpress.org/ticket/32105#comment:3 We will leave collation for the site admins to decide.

TODO

Go back to index length update on ezsearch_search_phrase
Add upgrade SQL
Add upgrade doc
~~Separate PRs for db charset in app/config/config.yml and legacy site.ini~~ No can do, not compatible with non-mysql dbs, note in upgrade doc instead
Do this in 2.2 instead

andrerom · 2018-03-15T07:42:14Z

@alongosz wdyt?

alongosz

Please be aware that *_unicode_520_ci exists since MySQL 5.6. Our Doc states that we support MySQL 5.5 also. But maybe the Doc is outdated? // @andrerom

This makes me wonder how it could be achieved in a generic (cross-DBMS) way by Doctrine DBAL Schema Tool that will be used in the future.

Other found issue:

alongosz · 2018-03-15T10:52:27Z

data/mysql/schema.sql

@@ -1601,7 +1601,7 @@ CREATE TABLE `ezprest_authcode` (
  `user_id` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `authcode_client_id` (`client_id`)
-) ENGINE=InnoDB DEFAULT CHARSET=utf8;
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;


This CREATE on my 10.1.28-MariaDB caused:

ERROR 1071 (42000) at line 1596: Specified key was too long; max key length is 767 bytes

AFAICS error exists also for some further CREATEs, but the first one is here.
I'm not sure what's the solution here.

The 767 bytes limit is a hard limit on innodb indexes. So it seems we can't have indexes on varchars bigger than 191. (191 x 4 = 764).
https://stackoverflow.com/questions/1814532/1071-specified-key-was-too-long-max-key-length-is-767-bytes?answertab=votes#tab-top

But we can't just willy nilly truncate all indexed varchars to 191 either - many are 255 chars long. No idea why this doesn't break the tests, but anyway... dang. 😖
I confirm the error on local mysql.

andrerom · 2018-03-15T11:20:04Z

Please be aware that *_unicode_520_ci exists since MySQL 5.6. Our Doc states that we support MySQL 5.5 also. But maybe the Doc is outdated? // @andrerom

It is not out of date per say, but if we merge it we will have to bump requirement to 5.6 for v2.1 (which is fine as long as we do it before stable).

That said I'm unsure if we should hard code COLLATE here at all, depending on language people are planning to use they might want to use something else taken from system config. But unsure what is best here, hence the ping to you.

Unsure why you get that error, did you find out more on why?

glye · 2018-03-15T11:49:30Z

If we don't specify collation I reckon it will default to utf8mb4_unicode_ci which is also okay(ish). If you try to use a 3-byte collation with 4-byte charset you'll get an error, so we don't have to worry about that at least.

glye · 2018-03-15T12:17:44Z

There are 6 cases of indexed varchars longer than 191. Last commit truncates these to 191, let's see what travis says about that.

NB: I don't know if the affected REST tables can contain 4-byte chars, but the ezsearch_search_phrase can, so that must be truncated at least. The others could perhaps stay on utf8 instead.

glye · 2018-03-15T12:46:24Z

@alongosz @andrerom The tests pass with truncated varchars. What do you think about this?

Are the 191-char limits on REST IDs and search phrases OK?
Do we drop the COLLATE change and leave that to site admins?
Do we bump mysql requirement to 5.5.3 for utf8mb4 or to 5.6 for utf8mb4_unicode_520_ci?

alongosz · 2018-03-15T13:20:04Z

Do we bump mysql requirement to 5.5.3 for utf8mb4 or to 5.6 for utf8mb4_unicode_520_ci?

Doc for 5.5 doesn't contain any reference to *_unicode_520_ci, so even if it exists there it's not guaranteed and we should recommend 5.6.

Do we drop the COLLATE change and leave that to site admins?

That would be the best approach, what needs to be checked is what is default here.
If it is inherited from database collation then this is fine.
If it is inherited from charset collation then it's not the best, but the idea of leaving it to db/site admins still works for me.

Are the 191-char limits on REST IDs and search phrases OK?

No. We can't truncate column lengths in general, it's a BC break. Workaround for that is introducing indices with specified length (we already have them). Not sure if it works for primary keys though, which is the case here also.

glye · 2018-03-15T13:38:15Z

Doc for 5.5 doesn't contain any reference to *_unicode_520_ci, so even if it exists there it's not guaranteed and we should recommend 5.6.

unicode_520_ci is 5.6+ only. But if we don't specify that, we can bump just from 5.5 to 5.5.3, which is the minimum for 4-byte.

That would be the best approach, what needs to be checked is what is default here.

Default for utf8mb4 is utf8mb4_general_ci which is fine for all except emoticon collation, afaik.

If it is inherited from database collation then this is fine.
If it is inherited from charset collation then it's not the best, but the idea of leaving it to db/site admins still works for me.

The latter. If table charset is set but not collation, then the default collation for that charset is used.
So we leave it to admins, then.

No. We can't truncate column lengths in general, it's a BC break. Workaround for that is introducing indices with specified length (we already have them). Not sure if it works for primary keys though, which is the case here also.

Thanks! Done. It seems to be fine for primary keys too.

glye · 2018-03-15T14:25:42Z

@alongosz @andrerom Travis accepts the shortened indexes. 🎉 Ready for re-review.

Bumping requirements: If it's unproblematic to require mysql 5.6, we should. Otherwise, we should require 5.5.3 but recommend 5.6.

andrerom · 2018-03-16T10:57:43Z

data/mysql/schema.sql

 CREATE TABLE `ezsearch_search_phrase` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `phrase` varchar(250) DEFAULT NULL,
  `phrase_count` int(11) DEFAULT '0',
  `result_count` int(11) DEFAULT '0',
  PRIMARY KEY (`id`),
-  UNIQUE KEY `ezsearch_search_phrase_phrase` (`phrase`),
+  UNIQUE KEY `ezsearch_search_phrase_phrase` (`phrase` (191)),


In this case specifically; given we introduce this in a new version where we don't so much care for BC.

I'd much rather make sure we use a size that works on the column on this case.
Why: this is search phrase, if code insert longer phrase then 191 it anyway means search on if won't really work beyond that as it won't be part of index. So unless I'm missing some thing can just as well change size of column and provide upgrade scripts for that.

Or?

means search on if won't really work beyond that as it won't be part of index

I think you're right. It may return multiple different results, if the first 191 chars of the phrases are the same. Agree @alongosz ?

Apropos, I should provide upgrade scripts for all the changes anyway - but we should agree on the changes first :)

glye · 2018-03-16T14:12:08Z

Regardless of whether we shorten the column or the (unique) index, there is a risk that the update script for this will fail due to duplicate entries. In that case can't really do anything but tell people to delete the extra entries. In the case of the ezsearch_search_phrase_phrase table they are not afaik relevant on new stack anyway, only in legacy.

There is precendence for changing charset: In 4.0 we added a script for changing to utf8.
https://github.com/ezsystems/ezpublish-legacy/blob/master/doc/features/4.0/dbcharserconversion.txt
https://github.com/ezsystems/ezpublish-legacy/blob/master/bin/php/ezconvertdbcharset.php
The doc describes some problems you may run into. Since utf8mb4 is fully backwards compatible with utf8 we don't have that problem, but index collisions can happen.

It does remind me that we should also change the database connection charset, ref app/config/config.yml and legacy site.ini.

alongosz · 2018-03-19T08:40:36Z

In this case specifically; given we introduce this in a new version where we don't so much care for BC.

Hmm, @andrerom but you mean 8.0, right? Schema change is a schema change. It doesn't matter if we add column or change existing column length. Otherwise we could also allow for #1947. [1]

That said, I'm all in favor of schema changes via upgrade script for now. We just need to figure out a way to ensure users execute these scripts. I was thinking about hooking something into composer update to check for any clues on what else needs to be updated. It could go even beyond db when we need it.

[1] So I'm not saying we shouldn't do it now, just pointing out your previous objections in such cases :)

In this particular case I'm not really sure how to handle the case when update truncates the data, not to mention what @glye have said about index collisions.

andrerom · 2018-03-19T08:52:32Z

Hmm, @andrerom but you mean 8.0, right? Schema change is a schema change. It doesn't matter if we add column or change existing column length. Otherwise we could also allow for #1947. [1]

No, I meant 7.1 or 7.2. There is a difference here: 1. that PR was made before we decided to allow for legacy to change a bit so we could start to evolve features that affected both, 2. that change is more severe, and will need less adoption on legacy.

That said, I'm all in favor of schema changes via upgrade script for now. We just need to figure out a way to ensure users execute these scripts. I was thinking about hooking something into composer update to check for any clues on what else needs to be updated. It could go even beyond db when we need it.

That would be DBAL schema migration, +1 to get there soon.

In the case of the ezsearch_search_phrase_phrase table they are not afaik relevant on new stack anyway, only in legacy.

@glye True, so lets keep the change here light then and only opt for index length update. What are the remaining todo's here? (Also keep in mind we can safely do this for v2.2 instead, by now that is looking safer overall)

glye · 2018-03-19T10:38:08Z

Reverted column shortening. Moved to 2.2, and added upgrade doc.
@andrerom No todos remaining.

andrerom · 2018-03-19T17:39:52Z

data/update/mysql/dbupdate-6.12.0-to-7.2.0.sql

This should be data/update/mysql/dbupdate-7.1.0-to-7.2.0.sql normally.

glye · 2018-03-20T08:26:03Z

@andrerom Corrected.

andrerom

Looks ok now besides nitpick. But I'll hold off on +1 until 7.1 is out and branched of.

andrerom · 2018-03-20T19:21:18Z

doc/upgrade/7.2.md

+
+## MySQL/MariaDB database tables character set change
+
+The character set for MySQL/MariaDB database tables is changed from `utf8` to `utf8mb4` to support 4-byte characters. See `data/update/mysql/dbupdate-6.12.0-to-7.2.0.sql` for the SQL upgrade statements.


data/update/mysql/dbupdate-7.0.0-to-7.2.0.sql

Right, thanks. Just making sure if 7.0 is not a typo, since you said 7.1 before...?

yes, it should be 7.1 :)

glye · 2018-03-23T10:12:55Z

Updated according to last feedback. Cleaned up commits.

andrerom

MERGE: Wait unti 7.1 is branched of

emodric · 2018-05-21T12:56:09Z

doc/upgrade/7.2.md

+```
+Also make the corresponding change in `app/config/dfs/dfs.yml`.
+
+For legacy, in `ezpublish_legacy/settings/i18n.ini`, set the following:


In my case, this results in heavy errors in legacy. I think it has to do with the fact that legacy is unable to read the INI files now, every call to $ini->variable() returns false. Any tips @glye @andrerom ?

@emodric That Should Not Happen™ and didn't for me. Are you saying it expects the ini files to have utf8mb4 charset, and returns nothing because that charset doesn't exist in this context? Disturbing news :(

I'm not 100% sure that's the case, but it does sound like it. The only outcome that I witnessed is that reading any INI variable just returns false.

EDIT: Once I change all INI files to declare utf8mb4 charset, then legacy templates start acting out...

@glye What's the advantage of using i18n.ini to set the charset, when we can use site.ini/DatabaseSettings/Charset directly, to specify the DB charset?

@emodric I did it for reasons that elude me know. It does look like you're right.

emodric · 2018-05-22T11:17:12Z

Anyway, I tried it with only DatabaseSettings/Charset too, and also doesn't work :/

glye · 2018-05-22T11:38:00Z

@emodric So, like this? https://github.com/ezsystems/ezpublish-kernel/blob/fbf605375324c89bce00309f9ff5bc78e7b49d61/doc/upgrade/7.2.md
You reverted the charset declarations in your ini files? And it fails in the same way as before?

emodric · 2018-05-22T12:38:06Z

Yep, but with that config, actual queries fail, which somehow get decoded/encoded wrong, so garbage is set to the DB.

glye · 2018-05-22T12:50:37Z

@emodric You have run the queries to change table charset, right?

emodric · 2018-05-22T12:53:35Z

Yes, ofcourse :)

I will test once more thorughly and report any issues. Slack is okay?

glye · 2018-05-22T12:54:35Z

Just checking :) Yep, I'm on slack.

glye · 2018-05-23T08:33:31Z

@emodric It seems this isn't resolved yet, but I made the PR for the doc change anyway, since I think you're right: #2337

glye requested a review from andrerom March 14, 2018 17:09

alongosz requested changes Mar 15, 2018

View reviewed changes

glye force-pushed the set_mysql_utf8mb4_schema branch from ff6778b to b96f8a2 Compare March 15, 2018 14:03

andrerom reviewed Mar 16, 2018

View reviewed changes

glye mentioned this pull request Mar 19, 2018

EZP-28950: MySQL UTF8 doesn't support 4-byte chars ezsystems/ezplatform#278

Closed

glye force-pushed the set_mysql_utf8mb4_schema branch from 0b516fd to fdb8311 Compare March 20, 2018 08:27

andrerom suggested changes Mar 20, 2018

View reviewed changes

glye force-pushed the set_mysql_utf8mb4_schema branch from fdb8311 to 1db9cca Compare March 23, 2018 10:12

andrerom approved these changes Mar 23, 2018

View reviewed changes

glye added 5 commits March 26, 2018 17:02

Set utf8mb4 charset and collation on mysql kernel schema and fixtures

17e72c4

Don't specify collation

929335e

Truncate varchar index lengths to 191 chars

83def35

Add update SQL

3e87f9d

Moved change to 2.2 / 7.2, added doc

6923a98

glye force-pushed the set_mysql_utf8mb4_schema branch from 1db9cca to 6923a98 Compare March 26, 2018 15:02

glye merged commit 0f82a08 into master Apr 3, 2018

glye deleted the set_mysql_utf8mb4_schema branch April 3, 2018 12:48

jacek-foremski mentioned this pull request May 10, 2018

EZP-29110: Changing tables charset to utf8mb4 result in "Specified key is too long" errors #2326

Merged

4 tasks

emodric reviewed May 21, 2018

View reviewed changes

glye mentioned this pull request May 23, 2018

EZP-28950 Correct legacy instructions, site.ini is the best place for DB charset #2337

Merged


		## MySQL/MariaDB database tables character set change

		The character set for MySQL/MariaDB database tables is changed from `utf8` to `utf8mb4` to support 4-byte characters. See `data/update/mysql/dbupdate-6.12.0-to-7.2.0.sql` for the SQL upgrade statements.

Set utf8mb4 charset and collation on mysql kernel schema and fixtures #2277

Set utf8mb4 charset and collation on mysql kernel schema and fixtures #2277

Conversation

glye commented Mar 14, 2018 • edited Loading

andrerom commented Mar 15, 2018

alongosz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glye Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

andrerom commented Mar 15, 2018 • edited Loading

glye commented Mar 15, 2018

glye commented Mar 15, 2018 • edited Loading

glye commented Mar 15, 2018

alongosz commented Mar 15, 2018

glye commented Mar 15, 2018 • edited Loading

glye commented Mar 15, 2018

andrerom Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glye commented Mar 16, 2018

alongosz commented Mar 19, 2018

andrerom commented Mar 19, 2018 • edited Loading

glye commented Mar 19, 2018 • edited Loading

andrerom commented Mar 19, 2018

glye commented Mar 20, 2018

andrerom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glye commented Mar 23, 2018

andrerom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emodric May 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emodric commented May 22, 2018

glye commented May 22, 2018

emodric commented May 22, 2018

glye commented May 22, 2018

emodric commented May 22, 2018

glye commented May 22, 2018

glye commented May 23, 2018

glye commented Mar 14, 2018 •

edited

Loading

glye Mar 15, 2018 •

edited

Loading

andrerom commented Mar 15, 2018 •

edited

Loading

glye commented Mar 15, 2018 •

edited

Loading

glye commented Mar 15, 2018 •

edited

Loading

andrerom Mar 16, 2018 •

edited

Loading

andrerom commented Mar 19, 2018 •

edited

Loading

glye commented Mar 19, 2018 •

edited

Loading

emodric May 22, 2018 •

edited

Loading