Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set utf8mb4 charset and collation on mysql kernel schema and fixtures #2277

Merged
merged 5 commits into from
Apr 3, 2018

Conversation

glye
Copy link
Member

@glye glye commented Mar 14, 2018

Question Answer
JIRA issue EZP-28950
Bug fix no
New feature kind of
Target version 7.2
BC breaks kind of (shorter indexes)
Tests pass yes
Doc needed yes (for changing existing databases)

We should use the utf8mb4 mysql charset which supports all of unicode, instead of utf8 which does not support 4-byte characters (which includes some Chinese, emojis, and other things).

Note that for collation I selected utf8mb4_unicode_520_ci instead of utf8mb4_unicode_ci, because the latter treats all 4-byte characters as the same. Ref https://core.trac.wordpress.org/ticket/32105#comment:3 We will leave collation for the site admins to decide.

TODO

  • Go back to index length update on ezsearch_search_phrase
  • Add upgrade SQL
  • Add upgrade doc
  • Separate PRs for db charset in app/config/config.yml and legacy site.ini No can do, not compatible with non-mysql dbs, note in upgrade doc instead
  • Do this in 2.2 instead

@glye glye requested a review from andrerom March 14, 2018 17:09
@andrerom
Copy link
Contributor

@alongosz wdyt?

Copy link
Member

@alongosz alongosz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be aware that *_unicode_520_ci exists since MySQL 5.6. Our Doc states that we support MySQL 5.5 also. But maybe the Doc is outdated? // @andrerom

This makes me wonder how it could be achieved in a generic (cross-DBMS) way by Doctrine DBAL Schema Tool that will be used in the future.

Other found issue:

@@ -1601,7 +1601,7 @@ CREATE TABLE `ezprest_authcode` (
`user_id` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `authcode_client_id` (`client_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CREATE on my 10.1.28-MariaDB caused:

ERROR 1071 (42000) at line 1596: Specified key was too long; max key length is 767 bytes

AFAICS error exists also for some further CREATEs, but the first one is here.
I'm not sure what's the solution here.

Copy link
Member Author

@glye glye Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 767 bytes limit is a hard limit on innodb indexes. So it seems we can't have indexes on varchars bigger than 191. (191 x 4 = 764).
https://stackoverflow.com/questions/1814532/1071-specified-key-was-too-long-max-key-length-is-767-bytes?answertab=votes#tab-top

But we can't just willy nilly truncate all indexed varchars to 191 either - many are 255 chars long. No idea why this doesn't break the tests, but anyway... dang. 😖
I confirm the error on local mysql.

@andrerom
Copy link
Contributor

andrerom commented Mar 15, 2018

Please be aware that *_unicode_520_ci exists since MySQL 5.6. Our Doc states that we support MySQL 5.5 also. But maybe the Doc is outdated? // @andrerom

It is not out of date per say, but if we merge it we will have to bump requirement to 5.6 for v2.1 (which is fine as long as we do it before stable).

That said I'm unsure if we should hard code COLLATE here at all, depending on language people are planning to use they might want to use something else taken from system config. But unsure what is best here, hence the ping to you.

Unsure why you get that error, did you find out more on why?

@glye
Copy link
Member Author

glye commented Mar 15, 2018

If we don't specify collation I reckon it will default to utf8mb4_unicode_ci which is also okay(ish). If you try to use a 3-byte collation with 4-byte charset you'll get an error, so we don't have to worry about that at least.

@glye
Copy link
Member Author

glye commented Mar 15, 2018

There are 6 cases of indexed varchars longer than 191. Last commit truncates these to 191, let's see what travis says about that.

NB: I don't know if the affected REST tables can contain 4-byte chars, but the ezsearch_search_phrase can, so that must be truncated at least. The others could perhaps stay on utf8 instead.

@glye
Copy link
Member Author

glye commented Mar 15, 2018

@alongosz @andrerom The tests pass with truncated varchars. What do you think about this?

  • Are the 191-char limits on REST IDs and search phrases OK?
  • Do we drop the COLLATE change and leave that to site admins?
  • Do we bump mysql requirement to 5.5.3 for utf8mb4 or to 5.6 for utf8mb4_unicode_520_ci?

@alongosz
Copy link
Member

Do we bump mysql requirement to 5.5.3 for utf8mb4 or to 5.6 for utf8mb4_unicode_520_ci?

Doc for 5.5 doesn't contain any reference to *_unicode_520_ci, so even if it exists there it's not guaranteed and we should recommend 5.6.

Do we drop the COLLATE change and leave that to site admins?

That would be the best approach, what needs to be checked is what is default here.
If it is inherited from database collation then this is fine.
If it is inherited from charset collation then it's not the best, but the idea of leaving it to db/site admins still works for me.

Are the 191-char limits on REST IDs and search phrases OK?

No. We can't truncate column lengths in general, it's a BC break. Workaround for that is introducing indices with specified length (we already have them). Not sure if it works for primary keys though, which is the case here also.

@glye
Copy link
Member Author

glye commented Mar 15, 2018

Doc for 5.5 doesn't contain any reference to *_unicode_520_ci, so even if it exists there it's not guaranteed and we should recommend 5.6.

unicode_520_ci is 5.6+ only. But if we don't specify that, we can bump just from 5.5 to 5.5.3, which is the minimum for 4-byte.

That would be the best approach, what needs to be checked is what is default here.

Default for utf8mb4 is utf8mb4_general_ci which is fine for all except emoticon collation, afaik.

If it is inherited from database collation then this is fine.
If it is inherited from charset collation then it's not the best, but the idea of leaving it to db/site admins still works for me.

The latter. If table charset is set but not collation, then the default collation for that charset is used.
So we leave it to admins, then.

No. We can't truncate column lengths in general, it's a BC break. Workaround for that is introducing indices with specified length (we already have them). Not sure if it works for primary keys though, which is the case here also.

Thanks! Done. It seems to be fine for primary keys too.

@glye glye force-pushed the set_mysql_utf8mb4_schema branch from ff6778b to b96f8a2 Compare March 15, 2018 14:03
@glye
Copy link
Member Author

glye commented Mar 15, 2018

@alongosz @andrerom Travis accepts the shortened indexes. 🎉 Ready for re-review.

Bumping requirements: If it's unproblematic to require mysql 5.6, we should. Otherwise, we should require 5.5.3 but recommend 5.6.

CREATE TABLE `ezsearch_search_phrase` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`phrase` varchar(250) DEFAULT NULL,
`phrase_count` int(11) DEFAULT '0',
`result_count` int(11) DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `ezsearch_search_phrase_phrase` (`phrase`),
UNIQUE KEY `ezsearch_search_phrase_phrase` (`phrase` (191)),
Copy link
Contributor

@andrerom andrerom Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case specifically; given we introduce this in a new version where we don't so much care for BC.

I'd much rather make sure we use a size that works on the column on this case.
Why: this is search phrase, if code insert longer phrase then 191 it anyway means search on if won't really work beyond that as it won't be part of index. So unless I'm missing some thing can just as well change size of column and provide upgrade scripts for that.

Or?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

means search on if won't really work beyond that as it won't be part of index

I think you're right. It may return multiple different results, if the first 191 chars of the phrases are the same. Agree @alongosz ?

Apropos, I should provide upgrade scripts for all the changes anyway - but we should agree on the changes first :)

@glye
Copy link
Member Author

glye commented Mar 16, 2018

Regardless of whether we shorten the column or the (unique) index, there is a risk that the update script for this will fail due to duplicate entries. In that case can't really do anything but tell people to delete the extra entries. In the case of the ezsearch_search_phrase_phrase table they are not afaik relevant on new stack anyway, only in legacy.

There is precendence for changing charset: In 4.0 we added a script for changing to utf8.
https://github.com/ezsystems/ezpublish-legacy/blob/master/doc/features/4.0/dbcharserconversion.txt
https://github.com/ezsystems/ezpublish-legacy/blob/master/bin/php/ezconvertdbcharset.php
The doc describes some problems you may run into. Since utf8mb4 is fully backwards compatible with utf8 we don't have that problem, but index collisions can happen.

It does remind me that we should also change the database connection charset, ref app/config/config.yml and legacy site.ini.

@alongosz
Copy link
Member

In this case specifically; given we introduce this in a new version where we don't so much care for BC.

Hmm, @andrerom but you mean 8.0, right? Schema change is a schema change. It doesn't matter if we add column or change existing column length. Otherwise we could also allow for #1947. [1]

That said, I'm all in favor of schema changes via upgrade script for now. We just need to figure out a way to ensure users execute these scripts. I was thinking about hooking something into composer update to check for any clues on what else needs to be updated. It could go even beyond db when we need it.

[1] So I'm not saying we shouldn't do it now, just pointing out your previous objections in such cases :)

In this particular case I'm not really sure how to handle the case when update truncates the data, not to mention what @glye have said about index collisions.

@andrerom
Copy link
Contributor

andrerom commented Mar 19, 2018

Hmm, @andrerom but you mean 8.0, right? Schema change is a schema change. It doesn't matter if we add column or change existing column length. Otherwise we could also allow for #1947. [1]

No, I meant 7.1 or 7.2. There is a difference here: 1. that PR was made before we decided to allow for legacy to change a bit so we could start to evolve features that affected both, 2. that change is more severe, and will need less adoption on legacy.

That said, I'm all in favor of schema changes via upgrade script for now. We just need to figure out a way to ensure users execute these scripts. I was thinking about hooking something into composer update to check for any clues on what else needs to be updated. It could go even beyond db when we need it.

That would be DBAL schema migration, +1 to get there soon.

In the case of the ezsearch_search_phrase_phrase table they are not afaik relevant on new stack anyway, only in legacy.

@glye True, so lets keep the change here light then and only opt for index length update. What are the remaining todo's here? (Also keep in mind we can safely do this for v2.2 instead, by now that is looking safer overall)

@glye
Copy link
Member Author

glye commented Mar 19, 2018

Reverted column shortening. Moved to 2.2, and added upgrade doc.
@andrerom No todos remaining.

@andrerom
Copy link
Contributor

data/update/mysql/dbupdate-6.12.0-to-7.2.0.sql

This should be data/update/mysql/dbupdate-7.1.0-to-7.2.0.sql normally.

@glye
Copy link
Member Author

glye commented Mar 20, 2018

@andrerom Corrected.

@glye glye force-pushed the set_mysql_utf8mb4_schema branch from 0b516fd to fdb8311 Compare March 20, 2018 08:27
Copy link
Contributor

@andrerom andrerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok now besides nitpick. But I'll hold off on +1 until 7.1 is out and branched of.


## MySQL/MariaDB database tables character set change

The character set for MySQL/MariaDB database tables is changed from `utf8` to `utf8mb4` to support 4-byte characters. See `data/update/mysql/dbupdate-6.12.0-to-7.2.0.sql` for the SQL upgrade statements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data/update/mysql/dbupdate-7.0.0-to-7.2.0.sql

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks. Just making sure if 7.0 is not a typo, since you said 7.1 before...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should be 7.1 :)

@glye glye force-pushed the set_mysql_utf8mb4_schema branch from fdb8311 to 1db9cca Compare March 23, 2018 10:12
@glye
Copy link
Member Author

glye commented Mar 23, 2018

Updated according to last feedback. Cleaned up commits.

Copy link
Contributor

@andrerom andrerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MERGE: Wait unti 7.1 is branched of

@glye glye force-pushed the set_mysql_utf8mb4_schema branch from 1db9cca to 6923a98 Compare March 26, 2018 15:02
@glye glye merged commit 0f82a08 into master Apr 3, 2018
@glye glye deleted the set_mysql_utf8mb4_schema branch April 3, 2018 12:48
```
Also make the corresponding change in `app/config/dfs/dfs.yml`.

For legacy, in `ezpublish_legacy/settings/i18n.ini`, set the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case, this results in heavy errors in legacy. I think it has to do with the fact that legacy is unable to read the INI files now, every call to $ini->variable() returns false. Any tips @glye @andrerom ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emodric That Should Not Happen™ and didn't for me. Are you saying it expects the ini files to have utf8mb4 charset, and returns nothing because that charset doesn't exist in this context? Disturbing news :(

Copy link
Contributor

@emodric emodric May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure that's the case, but it does sound like it. The only outcome that I witnessed is that reading any INI variable just returns false.

EDIT: Once I change all INI files to declare utf8mb4 charset, then legacy templates start acting out...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glye What's the advantage of using i18n.ini to set the charset, when we can use site.ini/DatabaseSettings/Charset directly, to specify the DB charset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emodric I did it for reasons that elude me know. It does look like you're right.

@emodric
Copy link
Contributor

emodric commented May 22, 2018

Anyway, I tried it with only DatabaseSettings/Charset too, and also doesn't work :/

@glye
Copy link
Member Author

glye commented May 22, 2018

@emodric So, like this? https://github.com/ezsystems/ezpublish-kernel/blob/fbf605375324c89bce00309f9ff5bc78e7b49d61/doc/upgrade/7.2.md
You reverted the charset declarations in your ini files? And it fails in the same way as before?

@emodric
Copy link
Contributor

emodric commented May 22, 2018

Yep, but with that config, actual queries fail, which somehow get decoded/encoded wrong, so garbage is set to the DB.

@glye
Copy link
Member Author

glye commented May 22, 2018

@emodric You have run the queries to change table charset, right?

@emodric
Copy link
Contributor

emodric commented May 22, 2018

Yes, ofcourse :)

I will test once more thorughly and report any issues. Slack is okay?

@glye
Copy link
Member Author

glye commented May 22, 2018

Just checking :) Yep, I'm on slack.

@glye
Copy link
Member Author

glye commented May 23, 2018

@emodric It seems this isn't resolved yet, but I made the PR for the doc change anyway, since I think you're right: #2337

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants