Set table collation to match core tables #329

colemanw · 2024-02-08T16:30:28Z

Overview

Fixes inconsistent collation between extension tables (including core extensions like SearchKit) and core tables.
See discussion at https://chat.civicrm.org/civicrm/pl/g3jt55akb3rk8dzgtqcr3bjjqy

Before

All core tables are ENGINE=InnoDB DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC;
All extension tables are ENGINE=InnoDB

After

All core and extension tables are ENGINE=InnoDB DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC;

colemanw · 2024-02-13T13:44:21Z

@totten it seems that this humble patch has sparked a much bigger discussion full of utopian dreams and aspirations...
but meanwhile this is still causing problems in the real world. For example, this latest report on the symptoms.

So while I really do like the direction of the utopian discussions, how about we merge this in the meantime.

totten · 2024-02-13T22:41:09Z

@colemanw OK, I want this to makes sense. On general principle, the thing to consider is whether this will be bouncy (fix one environment, break another). My best rationalization in favor it is:

The current rules for character-set/collation in civicrm-core originated in 5.33 (civicrm/civicrm-core@828c691#diff-f73a612034e9487c9dcca2569aeb7f42ce8e03a8932fbf51d4a5ad90e14c388eL5).
Since civix moved to <upgrader> tag, the generated entities have required civicrm-core@5.38+.
So if you're generating ext-SQL, then you're running on 5.38+, and the collations match... right?

But looking at the change from 5.33... and looking at current upgrader/status-checks... it doesn't look like we ever fixed the character-sets/collations of databases on historical data. If you originally installed CiviCRM 4.6 or 5.0 or 5.10 (or anything up to 5.33), then all your tables were initialized with utf8/utf8_unicode_ci, and nothing ever prompted you change. So if you install an extension that forces utf8mb4/utf8mb4_unicode_ci, then the schema will be mismatched. Right?

Doesn't this just shift the bad-schema-situation to different users?

(cc @seamuslee001 b/c I think you all followed the history of utf8(mb4) more)

Aside: I think I have a better way to maintain/hook-into the CRM_*_Upgrader and provide auto-generation. Pushing on that more today...

colemanw · 2024-02-14T00:12:31Z

Doesn't this just shift the bad-schema-situation to different users?

Well, I think that ship might have already sailed. Because if we wanted to avoid forcing the collation when we add new tables, then why did we do this?

https://github.com/civicrm/civicrm-core/blob/b11418429c627738ffaa9f2face1d6c68af3bd61/CRM/Upgrade/Incremental/sql/5.47.alpha1.mysql.tpl#L11

And this?

https://github.com/civicrm/civicrm-core/blob/3a3d7c89e1f4d9d45fb21a6a9bf769c4d4a361da/CRM/Upgrade/Incremental/sql/5.50.alpha1.mysql.tpl#L20

This uses the new syntax generated by totten/civix#329

See totten/civix#329

mattwire · 2024-02-14T10:56:19Z

@colemanw @totten I remember a time when we had specified UTF8 in lots of extensions and it caused major problems when switching systems to utf8mb4 (because the SQL forced the extension to UTF8 instead of using the database default which was set to utf8mb4). So the decision at the time was to specifically exclude the database collation from the SQL and rely on you setting the default collation on the database once only. That allows for an easy future change of collation and also makes it easier if someone decides they need to run with another collation. I think there may have been some issues with forcing ROW_FORMAT as well. @artfulrobot I seem to remember you having some thoughts around this.

colemanw · 2024-02-14T12:47:49Z

@mattwire you're correct about the history, but that has left us in a place of sometimes relying on db default and sometimes not. After a long discussion about it on ~dev this is what we've come to:

Civi core always forces collation when creating new tables, even when adding them to existing sites. and civix-generated extensions do not. For some sites this inconsistency causes serious problems. So 3 steps toward the ideal fix of both consistency and configurability would be:

Stop the bleeding (this patch).
Consistently generate install sql from a centralized place in both core and extensions.
Stop generating sql files on disk. Generate & run sql in-memory during install/upgrade/uninstall so we never hard-code collations. This opens the door for a configurable setting for db encoding/collation (which could be a specific value like core does, or "none" to respect db default like extensions do). See this project which is making good progress toward that goal.

mattwire · 2024-02-14T14:26:28Z

I'm all in favour of 2 and 3. Just not sure that 1. is a good idea. I have over 100 extensions on my local dev environment and none of them specify database charset/collation in their SQL except some really old ones that specify utf8. I'd rather not have to go through a round of updating sql files only to be replaced with 2 and 3 and potentially require another round of updates.
I agree this is sometimes a problem and it's one of the checks I always do when taking on an existing CiviCRM site. I check and update all tables to utf8mb4_unicode and make sure that is the database default. Note I've had problems before with a mixture of utf8mb4_general and utf8mb4_unicode as drupal seems to prefer the first one.

mattwire · 2024-02-14T14:31:51Z

An aside to this, and probably more important as it's certainly tripped me up a few times is when extensions have tables that don't start with civicrm_ -eg. cividiscount_ and civirule_ as these tend to get ignored by things like System.utf8conversion and detailed logging.

colemanw · 2024-02-14T14:35:16Z

Cool.

I guess I didn't see this PR so much as "require a bunch of extra work from ext maintainers" as "prevent extra work or regressions" because at this point we've patched all core extensions to be consistent with core but if we don't merge this update and someone reruns civix generate:entity-boilerplate then those fixes will be automatically undone.

I think it's otherwise fine for you to sit on your 100 extensions and not update them until the full fix is ready.

artfulrobot · 2024-02-14T14:39:32Z

I have this note to self:

Note never use utf8mb4_general_ci, prefer utf8mb4_unicode_ci. If there's one utf8mb4_0900_ai_ci then even better because it is (a) a UNICODE standard algorithm, and (b) it includes accent insensitivity, which is useful in the same way that case insensitivity is. (But I say this as an English speaker who rarely uses accents, can't find them on the keyboard sometimes, so this might be a narrow viewpoint).

As for ROW_FORMAT I'm OK with that as a default, as long as we're allowed to change it. I save megagigapetaexa bytes by choosing COMPRESS, for example, on tables like Activities which has rare updates and lots of additions with similar data. It makes searching Activities table much faster too as the db can load more rows/page. I discovered/developed this change when a particularly busy site crashed the db through an extension that was creating excessive activities - a disk-full crash on a database is not a nice thing to recover from.

colemanw · 2024-02-14T14:43:18Z

when extensions have tables that don't start with civicrm_ -eg. cividiscount_ and civirule_ as these tend to get ignored

Gawd why do we bother maintaining a complete list of all dao tables when stuff like that just ignores it in favor of string-fu?

colemanw · 2024-02-14T14:49:46Z

@artfulrobot I save megagigapetaexa bytes

Wow that's a lot! By my calculations that's 10^6 * 10^9 * 10^15 * 10^18! 🤓

But yea we should take into account that some super-smart admins might want to change row format and not be too restrictive about that. Or if we were super-smart maybe we could come up with some better defaults for that setting per-table.

mattwire · 2024-02-14T14:53:41Z

when extensions have tables that don't start with civicrm_ -eg. cividiscount_ and civirule_ as these tend to get ignored

Gawd why do we bother maintaining a complete list of all dao tables when stuff like that just ignores it in favor of string-fu?

Yeah.. then we have things like CRM_Core_DAO::getTableNames() that just returns every table that starts with civicrm_ and I think that AllCoreTables only returns tables linked to an entity - so possibly including non standard names?

artfulrobot · 2024-02-14T15:05:21Z

Wow that's a lot! By my calculations that's 10^6 * 10^9 * 10^15 * 10^18! 🤓

And yet it's still a tiny number compared to the possible moves in a game of Go - although to be fair, that number is calculated to exceed the number of atoms in the universe, so... 😆

This uses the new syntax generated by totten/civix#329

See totten/civix#329

totten · 2024-09-25T23:58:36Z

Previously replaced by #331

Set table collation to match core tables

cd14b80

colemanw mentioned this pull request Feb 13, 2024

api\v4\EckEntity\EckEntityTest::testTwoEntityTypes() fails with DB Error systopia/de.systopia.eck#115

Closed

colemanw added a commit to colemanw/civicrm-core that referenced this pull request Feb 14, 2024

Ext - Specify COLLATE when creating tables

5cc8b1c

This uses the new syntax generated by totten/civix#329

colemanw mentioned this pull request Feb 14, 2024

Ext - Specify COLLATE when creating tables civicrm/civicrm-core#29384

Merged

colemanw added a commit to civicrm/org.civicrm.contactlayout that referenced this pull request Feb 14, 2024

Fix install collation

3d9b954

See totten/civix#329

colemanw mentioned this pull request Feb 14, 2024

Fix install collation civicrm/org.civicrm.contactlayout#141

Merged

colemanw added a commit to colemanw/uk.co.vedaconsulting.mosaico that referenced this pull request Feb 14, 2024

Fix install collation

69ad2fe

See totten/civix#329

colemanw mentioned this pull request Feb 14, 2024

Fix install collation veda-consulting-company/uk.co.vedaconsulting.mosaico#634

Closed

eileenmcnaughton pushed a commit to eileenmcnaughton/civicrm-core that referenced this pull request Feb 16, 2024

Ext - Specify COLLATE when creating tables

299bf5e

This uses the new syntax generated by totten/civix#329

totten mentioned this pull request Feb 22, 2024

(dev/core#4999) Support Entity Framework v2 for extensions #331

Merged

mlutfy pushed a commit to coopsymbiotic/coop.symbiotic.timetrack that referenced this pull request Aug 30, 2024

Fix install collation

c929d17

See totten/civix#329

totten closed this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set table collation to match core tables #329

Set table collation to match core tables #329

colemanw commented Feb 8, 2024

colemanw commented Feb 13, 2024

totten commented Feb 13, 2024

colemanw commented Feb 14, 2024

mattwire commented Feb 14, 2024

colemanw commented Feb 14, 2024

mattwire commented Feb 14, 2024 •

edited

Loading

mattwire commented Feb 14, 2024

colemanw commented Feb 14, 2024

artfulrobot commented Feb 14, 2024

colemanw commented Feb 14, 2024

colemanw commented Feb 14, 2024

mattwire commented Feb 14, 2024

artfulrobot commented Feb 14, 2024

totten commented Sep 25, 2024

Set table collation to match core tables #329

Set table collation to match core tables #329

Conversation

colemanw commented Feb 8, 2024

Overview

Before

After

colemanw commented Feb 13, 2024

totten commented Feb 13, 2024

colemanw commented Feb 14, 2024

mattwire commented Feb 14, 2024

colemanw commented Feb 14, 2024

mattwire commented Feb 14, 2024 • edited Loading

mattwire commented Feb 14, 2024

colemanw commented Feb 14, 2024

artfulrobot commented Feb 14, 2024

colemanw commented Feb 14, 2024

colemanw commented Feb 14, 2024

mattwire commented Feb 14, 2024

artfulrobot commented Feb 14, 2024

totten commented Sep 25, 2024

mattwire commented Feb 14, 2024 •

edited

Loading