-
-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REF] Refactor Smart Group Cache population code to be less intensive #15588
[REF] Refactor Smart Group Cache population code to be less intensive #15588
Conversation
(Standard links)
|
@@ -480,30 +480,15 @@ public static function load(&$group, $force = FALSE) { | |||
return; | |||
} | |||
|
|||
// grab a lock so other processes don't compete and do the same query | |||
$lock = Civi::lockManager()->acquire("data.core.group.{$groupID}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing this - seems like a good thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eileenmcnaughton its moved further lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but why? It seems like a good thing where it was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only lock the database for as short a time as we need and that is when we are doing the actual operations on civicrm_group_contact_cache. We're using a temp table much more than what the previous code was doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not locking the database though - we are specifically telling other processes that want to rebuild the cache for the specific group not to because a process is already doing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's as @seamuslee001 says. We're telling other processes via a static self::$_alreadyLoaded[$groupID] = 1;
at the start of the function that we're already building. But the building can take a significant (in cpu/database terms) amount of time and we only want the mysql lock (which is what Civi::lockmanager() gives us) for the shortest amount of time which is why it's moved down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the broken lock system should be gotten rid of, rather than allowing it to be configured to work on certain versions of mysql. It causes weird side effects like this desire to minimize the time that locks are held. A simple database table for scoped locks works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it IS broken if you have the define set - which @mattwire proposes to do later on.
@mattwire I think you are saying 'we only really need to lock it when we are moving data from the temp table to the live table'. So that means we have 2 processes
Process one
- starts potentially slow query to fill temp table
- grabs lock
- moves data from temp table to live table
- releases lock
Process two
- starts potentially slow query to fill temp table
- tries to grab lock & fails
- aborts
Doing the potentially slow query is only useful if it will then be able to do something with the results. The alternative is
Process one
- starts potentially-slow query to fill temp table
- grabs lock
- moves data from temp table to live table
- releases lock
Process two
- tries to grab lock & fails
- aborts without running potentially-slow query to fill temp table
That seems better to me - especially since it might not be process 2 running the potentially slow query but processes 2,3,4,5,6,7,8,9,10.
// Don't call clearGroupContactCache as we don't want to clear the cache dates | ||
// The will get updated by updateCacheTime() below and not clearing the dates reduces | ||
// the chance that loadAll() will try and rebuild at the same time. | ||
$clearCacheQuery = " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this back in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure why but maybe @mattwire can comment, however we only want the contacts that match the criteria in the cache so this seems to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we should only ever be calling 'load' when the cache has already been emptied - ie on demand check if cache is valid. If not purge (if required) and then call load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that actively clearing the cache is expensive and we don't need to do it in advance. Setting the cache to expired is enough because that will trigger a rebuild next time it is required. But obviously we do need to clear it before re-populating so the DELETE query + INSERT query are next to each other to be as atomic as possible. As soon as the cache table is actually up to date we set the cache time so that any process accessing the cache knows that it is up to date and can be used.
@seamuslee001 my big picture thoughts here are that the PR template needs a lot more detail about the technical changes made in this & why and also that we'll need to get some more sites to test this 'in the wild' before merging to core |
@bjendres @davejenx @mfb @lucianov88 @pfigel have an interest in performance |
CRM_Core_DAO::executeQuery($clearCacheQuery, $params); | ||
|
||
CRM_Core_DAO::executeQuery( | ||
"INSERT IGNORE INTO civicrm_group_contact_cache (contact_id, group_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need INSERT IGNORE here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we shouldn't, but we do! Without INSERT IGNORE I was still seeing occasional database exceptions because of duplicate entries (every few hours or so). It suggests that there is still a hidden contention issue somewhere deep down but I couldn't find out where. It would be good to add a comment to that effect here to be clear why we have INSERT IGNORE but solving that problem shouldn't hold up this PR and could be a follow-up once we've sorted out the major deadlocks via this PR. It is worth noting that INSERT IGNORE is significantly slower than INSERT by itself so would be nice to resolve at some point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually just looking - this is a copy of the code in self::clearGroupContactCache($groupID); - which is called a few lines earlier - so this isn't a behaviour change - just a duplication
I am happy to get this patch merged because I have reviewed this code earlier and deployed it in a large prod site. And its day 11, without any glitch or crash in the cache rebuild process. |
$groupContactsTempTable->drop(); | ||
self::updateCacheTime([$groupID], TRUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be nice to swap these lines over as updating the cache time is more "urgent" that cleaning up.
@seamuslee001 @monishdeb Thanks so much for picking this up! @eileenmcnaughton Thanks for reviewing. |
It would be worth a parallel PR to remove the |
@mattwire we are running that define in production - if you are too then I think that is enough to pull it out (I always thought it overly cautious TBH) |
I still think we need a really good PR template on this one. My take on it is the main change is inserting into the temp table first so we lock the main table for a much shorter time period |
I'm going to merge this on the basis that:
One of the things this PR does is make it much clearer what the caching code is doing - we get away from things like |
Good work! Hoped to be able to test sooner, a new person in place at the relevant site makes that more feasible now. Is the PR expected to improve performance both with the older & newer locking mechanisms (MySQL >= 5.7)? |
@davejenx The multiple locks is enabled by default if available now (GET_LOCK() in mysql). But it will always be broken on older versions of MySQL (< 5.7.5) or MariaDB (< 10.0.2) so you will need to make sure you have at least those SQL versions. |
Congrats everyone esp @mattwire @seamuslee001 on getting this PR merged. Its a big leap in resolving the intermittent deadlock situation that occurs during smart cache rebuild. |
@mattwire Thanks, so to clarify, the PR will only improve performance when using the newer locking mechanism (MySQL >= 5.7.5) and won't benefit sites on older MySQL versions? |
@davejenx Correct |
Overview
Refactor Smart Group Cache population to be less intensive. AUG have deployed this on our system and appears to be working fine, This is from @mattwire https://github.com/mattwire/civicrm-groupaclcache/blob/master/patches/master...mattwire:refactor_groupaclcache.diff#L108
Before
Cache building can be a intensive
After
Cache building less intensive
ping @mattwire @eileenmcnaughton @monishdeb @JoeMurray