Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REF] Refactor Smart Group Cache population code to be less intensive #15588

Merged

Conversation

seamuslee001
Copy link
Contributor

Overview

Refactor Smart Group Cache population to be less intensive. AUG have deployed this on our system and appears to be working fine, This is from @mattwire https://github.com/mattwire/civicrm-groupaclcache/blob/master/patches/master...mattwire:refactor_groupaclcache.diff#L108

Before

Cache building can be a intensive

After

Cache building less intensive

ping @mattwire @eileenmcnaughton @monishdeb @JoeMurray

@civibot
Copy link

civibot bot commented Oct 23, 2019

(Standard links)

@civibot civibot bot added the master label Oct 23, 2019
@@ -480,30 +480,15 @@ public static function load(&$group, $force = FALSE) {
return;
}

// grab a lock so other processes don't compete and do the same query
$lock = Civi::lockManager()->acquire("data.core.group.{$groupID}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this - seems like a good thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eileenmcnaughton its moved further lower

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but why? It seems like a good thing where it was

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only lock the database for as short a time as we need and that is when we are doing the actual operations on civicrm_group_contact_cache. We're using a temp table much more than what the previous code was doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not locking the database though - we are specifically telling other processes that want to rebuild the cache for the specific group not to because a process is already doing that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's as @seamuslee001 says. We're telling other processes via a static self::$_alreadyLoaded[$groupID] = 1; at the start of the function that we're already building. But the building can take a significant (in cpu/database terms) amount of time and we only want the mysql lock (which is what Civi::lockmanager() gives us) for the shortest amount of time which is why it's moved down.

Copy link
Contributor

@mfb mfb Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the broken lock system should be gotten rid of, rather than allowing it to be configured to work on certain versions of mysql. It causes weird side effects like this desire to minimize the time that locks are held. A simple database table for scoped locks works fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it IS broken if you have the define set - which @mattwire proposes to do later on.

@mattwire I think you are saying 'we only really need to lock it when we are moving data from the temp table to the live table'. So that means we have 2 processes

Process one

  • starts potentially slow query to fill temp table
  • grabs lock
  • moves data from temp table to live table
  • releases lock

Process two

  • starts potentially slow query to fill temp table
  • tries to grab lock & fails
  • aborts

Doing the potentially slow query is only useful if it will then be able to do something with the results. The alternative is

Process one

  • starts potentially-slow query to fill temp table
  • grabs lock
  • moves data from temp table to live table
  • releases lock

Process two

  • tries to grab lock & fails
  • aborts without running potentially-slow query to fill temp table

That seems better to me - especially since it might not be process 2 running the potentially slow query but processes 2,3,4,5,6,7,8,9,10.

// Don't call clearGroupContactCache as we don't want to clear the cache dates
// The will get updated by updateCacheTime() below and not clearing the dates reduces
// the chance that loadAll() will try and rebuild at the same time.
$clearCacheQuery = "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this back in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure why but maybe @mattwire can comment, however we only want the contacts that match the criteria in the cache so this seems to do it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we should only ever be calling 'load' when the cache has already been emptied - ie on demand check if cache is valid. If not purge (if required) and then call load

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that actively clearing the cache is expensive and we don't need to do it in advance. Setting the cache to expired is enough because that will trigger a rebuild next time it is required. But obviously we do need to clear it before re-populating so the DELETE query + INSERT query are next to each other to be as atomic as possible. As soon as the cache table is actually up to date we set the cache time so that any process accessing the cache knows that it is up to date and can be used.

@eileenmcnaughton
Copy link
Contributor

@seamuslee001 my big picture thoughts here are that the PR template needs a lot more detail about the technical changes made in this & why and also that we'll need to get some more sites to test this 'in the wild' before merging to core

@eileenmcnaughton
Copy link
Contributor

@bjendres @davejenx @mfb @lucianov88 @pfigel have an interest in performance

CRM_Core_DAO::executeQuery($clearCacheQuery, $params);

CRM_Core_DAO::executeQuery(
"INSERT IGNORE INTO civicrm_group_contact_cache (contact_id, group_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need INSERT IGNORE here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we shouldn't, but we do! Without INSERT IGNORE I was still seeing occasional database exceptions because of duplicate entries (every few hours or so). It suggests that there is still a hidden contention issue somewhere deep down but I couldn't find out where. It would be good to add a comment to that effect here to be clear why we have INSERT IGNORE but solving that problem shouldn't hold up this PR and could be a follow-up once we've sorted out the major deadlocks via this PR. It is worth noting that INSERT IGNORE is significantly slower than INSERT by itself so would be nice to resolve at some point!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually just looking - this is a copy of the code in self::clearGroupContactCache($groupID); - which is called a few lines earlier - so this isn't a behaviour change - just a duplication

@monishdeb
Copy link
Member

I am happy to get this patch merged because I have reviewed this code earlier and deployed it in a large prod site. And its day 11, without any glitch or crash in the cache rebuild process.

Comment on lines +650 to +651
$groupContactsTempTable->drop();
self::updateCacheTime([$groupID], TRUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be nice to swap these lines over as updating the cache time is more "urgent" that cleaning up.

@mattwire
Copy link
Contributor

@seamuslee001 @monishdeb Thanks so much for picking this up! @eileenmcnaughton Thanks for reviewing.

@mattwire
Copy link
Contributor

It would be worth a parallel PR to remove the CIVICRM_SUPPORT_MULTIPLE_LOCKS define and make it the default as it's been available for quite some time now and without it being enabled your groupcontact/ACL caches die horribly if you send a mail. FWIW I use it on all my client sites.

@eileenmcnaughton
Copy link
Contributor

@mattwire we are running that define in production - if you are too then I think that is enough to pull it out (I always thought it overly cautious TBH)

@eileenmcnaughton
Copy link
Contributor

I still think we need a really good PR template on this one. My take on it is the main change is inserting into the temp table first so we lock the main table for a much shorter time period

@mattwire
Copy link
Contributor

mattwire commented Nov 1, 2019

I'm going to merge this on the basis that:

  • It's been in use on some of my client sites which previously exhibited serious mysql deadlocks for over a year and there have been no deadlock issues.
  • @JoeMurray have this on at least one production site and it's resolved issues for them.
  • @seamuslee001 has this on a very large production site and it's resolved issues for them.

One of the things this PR does is make it much clearer what the caching code is doing - we get away from things like $sqlA and $sqlB. There are almost certainly multiple ways this could be improved further but we can build on it later.

@mattwire mattwire merged commit be2d4a0 into civicrm:master Nov 1, 2019
@davejenx
Copy link
Contributor

davejenx commented Nov 1, 2019

Good work! Hoped to be able to test sooner, a new person in place at the relevant site makes that more feasible now.

Is the PR expected to improve performance both with the older & newer locking mechanisms (MySQL >= 5.7)?

@mattwire
Copy link
Contributor

mattwire commented Nov 1, 2019

@davejenx The multiple locks is enabled by default if available now (GET_LOCK() in mysql). But it will always be broken on older versions of MySQL (< 5.7.5) or MariaDB (< 10.0.2) so you will need to make sure you have at least those SQL versions.

@monishdeb
Copy link
Member

monishdeb commented Nov 1, 2019

Congrats everyone esp @mattwire @seamuslee001 on getting this PR merged. Its a big leap in resolving the intermittent deadlock situation that occurs during smart cache rebuild.

@davejenx
Copy link
Contributor

davejenx commented Nov 1, 2019

@mattwire Thanks, so to clarify, the PR will only improve performance when using the newer locking mechanism (MySQL >= 5.7.5) and won't benefit sites on older MySQL versions?

@mattwire
Copy link
Contributor

mattwire commented Nov 1, 2019

@davejenx Correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants