-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix State.SyncChanges mutex Lock/Unlock #6619
Fix State.SyncChanges mutex Lock/Unlock #6619
Conversation
If I'm not mistaken this introduces a race for both service and checks. When deleting a service :
When updating a service :
I'm not sure what the best way to fix this would be, but one way would be to maintain a state version number : the version is incremented on each state change and is stored in the ServiceState/CheckState structs. The Additionally, |
@Aestek I see the complexity behind this problem. Let me have an other run with it. |
@Aestek I think it get fixed. So I'm checking the {ServiceState,CheckState}.Deleted, because in case of AddService these will be false, so after the RPC response arrives and the Deleted still true, it should mean there is no new service under the same id. |
@PumpkinSeed this fixes the "delete" scenario, but not the "update" |
@Aestek yes, it's a bit tricky. I would like to have a bit more time to investigate about. |
@Aestek I tried it with your suggestion of versioning. |
I just ran into this same issue (actually ran into it because of disk space issues on the master nodes which caused API slowdown -- #866). While debugging this I set up a repro environment locally (master + agent with tc adding 3s latency in the middle). I have pulled down the PR branch and am running this build and it seems to resolve the issue. I've also read through the PR and it seems like a reasonable fix to the problem (versioning works since its single-source data -- from the agent -- that needs to eventually be pushed to the masters). So I am very interested in getting this merged in, so if there is anything I can do to help this along let me know :) |
This looks good to me as well |
May I ask what is the status of the PR? I have seen there is conflict. I will check it out tonight. |
…ul into I6616-sync-changes-lock
Codecov Report
@@ Coverage Diff @@
## master #6619 +/- ##
=========================================
Coverage ? 65.62%
=========================================
Files ? 443
Lines ? 53359
Branches ? 0
=========================================
Hits ? 35017
Misses ? 14111
Partials ? 4231 Continue to review full report at Codecov.
|
@freddygv any update on this PR? This fixes a huge liability for users of consul agent services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work. @freddygv and I left a bunch of comments, I hope it helps.
We also discovered a problem with how semantics have changed for SyncChanges
:
- ServiceA is updated and RPC is sent
- ServiceB is updated and RPC is sent (because the response for ServiceA is not back, this time
SyncChanges
sends out two RPCs: ServiceA and ServiceB) - ServiceC is updated and RPC is sent (because the response for ServiceA and ServiceB are not back, this time
SyncChanges
sends out three RPCs: ServiceA, ServiceB and ServiceC) - and so on
We are looking to hear what you think. We think this is an important issue, but it is also pretty tricky to get right. Thanks again.
var checkKey structs.CheckID | ||
for _, check := range checks { | ||
checkKey.Init(check.CheckID, &check.EnterpriseMeta) | ||
l.checks[checkKey].InSync = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know it is in sync, because we did not check the version. Checks are all send along with Service registrations and if we want to mark them as in sync, we need to version them all as well. And check the version here.
I fixed nearly all of the suggested things. But I already lost my domain knowledge since I created the PR and also it's started to get a hell of changes and I can't oversee it properly.
Then when I oversee the problem again properly we can get back to the bigger issue what you told me. Or if you don't want to have this restructure I can get back my domain on the current implementation. But I recommend that, because it's really hard to manage it right now I guess. Let me know what you think. |
@PumpkinSeed, after having a team-wide discussion today, we would like to suggest that our team makes the remaining changes to this PR. Since the anti-entropy sync is so fundamental in the operation of Consul, we need to spend a significant amount of additional time making sure that:
Due to the time commitment needed we also need to decide how to prioritize this work relative to other features on our roadmap. So the enhancement could end up on hold again because of that. Lastly, please note that we would like to preserve the changes that you have made, and if we do, we will ensure that you are credited in the git history. Thanks again for your contribution. We really appreciate the time and effort you have put into this problem. |
@freddygv |
@freddygv any updated ETA on this fix? I've bee hit by this issue a couple of times and am anxious to get a fix in place. |
Hi @jacksontj, we picked this back up earlier this week. It should make its way into the next release as long as we feel comfortable with the changes. |
@freddygv Just checking in, any update on this issue? |
@freddygv any update on this issue? |
Hi @jacksontj we have been working on and off on the PR, but didn't have much time due to the 1.8.0 release. Going to take another pass at it next week and should have a more concrete update then. |
@freddygv just checking in again here, really excited to get this fix in :) |
@jacksontj we're still working on an implementation on a new branch. I'll post here when we have PRs out. |
@freddygv checking in now thats it been another 8 months :) |
@jacksontj thanks for reaching out! We worked on this change quite a bit but don't have anything to show so far. This is a really challenging task because it changes Consul behaviour significantly. And while we still think this is a great idea, we no longer think it can be easily done. |
Thank you @PumpkinSeed for your work on this issue! Also thank you to everyone else who has helped or expressed interest in getting this fixed. This is a difficult issue to solve and will likely require a substantial change to the We have an issue (#6616) which describes the problem, and #8778 which is a potential first step for solving it. Please subscribe to those to be notified of any updates. |
This PR wants to resolve the #6616
I try to accomplish as minimal lists about the necessary fields as I could, so I created two function scope map.