Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix State.SyncChanges mutex Lock/Unlock #6619

Closed

Conversation

PumpkinSeed
Copy link
Contributor

@PumpkinSeed PumpkinSeed commented Oct 13, 2019

This PR wants to resolve the #6616

I try to accomplish as minimal lists about the necessary fields as I could, so I created two function scope map.

@hashicorp-cla
Copy link

hashicorp-cla commented Oct 13, 2019

CLA assistant check
All committers have signed the CLA.

@ShimmerGlass
Copy link
Contributor

If I'm not mistaken this introduces a race for both service and checks.

When deleting a service :

  1. SyncChanges() is called, the diff is computed, lock is released
  2. deleteService() is called and makes a Catalog.Deregister RPC
  3. A service with the same id is registered
  4. The RPC returns and the service is removed from the map when it should not have been.

When updating a service :

  1. SyncChanges() is called, the diff is computed, lock is released
  2. syncService() is called and makes a Catalog.Register RPC
  3. The service is updated
  4. The RPC returns and the service is marked as "InSync" when its in fact not and the update is delayed until next anti-antropy run.

I'm not sure what the best way to fix this would be, but one way would be to maintain a state version number : the version is incremented on each state change and is stored in the ServiceState/CheckState structs. The {sync,delete}{Service,Check} funcs are called with the version their respecting entities had when the diff was computed (and the lock was held). If when the RPCs return the entity version is not what was passed as arguments the entity is not marked as in sync nor deleted from state.

Additionally, delete{Service,Check} currently expect the state lock to be held when called. This introduces a crash due to the concurrent writes to State.{services,checks}.

@PumpkinSeed
Copy link
Contributor Author

@Aestek I see the complexity behind this problem. Let me have an other run with it.

@PumpkinSeed
Copy link
Contributor Author

@Aestek I think it get fixed. So I'm checking the {ServiceState,CheckState}.Deleted, because in case of AddService these will be false, so after the RPC response arrives and the Deleted still true, it should mean there is no new service under the same id.

@ShimmerGlass
Copy link
Contributor

@PumpkinSeed this fixes the "delete" scenario, but not the "update"

@PumpkinSeed
Copy link
Contributor Author

@Aestek yes, it's a bit tricky. I would like to have a bit more time to investigate about.

@PumpkinSeed
Copy link
Contributor Author

@Aestek I tried it with your suggestion of versioning.

@freddygv freddygv self-assigned this Oct 22, 2019
@jacksontj
Copy link

I just ran into this same issue (actually ran into it because of disk space issues on the master nodes which caused API slowdown -- #866). While debugging this I set up a repro environment locally (master + agent with tc adding 3s latency in the middle). I have pulled down the PR branch and am running this build and it seems to resolve the issue.

I've also read through the PR and it seems like a reasonable fix to the problem (versioning works since its single-source data -- from the agent -- that needs to eventually be pushed to the masters).

So I am very interested in getting this merged in, so if there is anything I can do to help this along let me know :)

@ShimmerGlass
Copy link
Contributor

This looks good to me as well

@PumpkinSeed
Copy link
Contributor Author

May I ask what is the status of the PR? I have seen there is conflict. I will check it out tonight.

@codecov
Copy link

codecov bot commented Dec 13, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@35a6279). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #6619   +/-   ##
=========================================
  Coverage          ?   65.62%           
=========================================
  Files             ?      443           
  Lines             ?    53359           
  Branches          ?        0           
=========================================
  Hits              ?    35017           
  Misses            ?    14111           
  Partials          ?     4231

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35a6279...ddccf8d. Read the comment docs.

@jacksontj
Copy link

jacksontj commented Jan 6, 2020

@freddygv any update on this PR? This fixes a huge liability for users of consul agent services.

Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work. @freddygv and I left a bunch of comments, I hope it helps.

We also discovered a problem with how semantics have changed for SyncChanges:

  • ServiceA is updated and RPC is sent
  • ServiceB is updated and RPC is sent (because the response for ServiceA is not back, this time SyncChanges sends out two RPCs: ServiceA and ServiceB)
  • ServiceC is updated and RPC is sent (because the response for ServiceA and ServiceB are not back, this time SyncChanges sends out three RPCs: ServiceA, ServiceB and ServiceC)
  • and so on

We are looking to hear what you think. We think this is an important issue, but it is also pretty tricky to get right. Thanks again.

agent/local/state.go Outdated Show resolved Hide resolved
agent/local/state.go Outdated Show resolved Hide resolved
agent/local/state.go Outdated Show resolved Hide resolved
var checkKey structs.CheckID
for _, check := range checks {
checkKey.Init(check.CheckID, &check.EnterpriseMeta)
l.checks[checkKey].InSync = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know it is in sync, because we did not check the version. Checks are all send along with Service registrations and if we want to mark them as in sync, we need to version them all as well. And check the version here.

agent/local/state.go Outdated Show resolved Hide resolved
agent/local/state.go Outdated Show resolved Hide resolved
agent/local/state.go Outdated Show resolved Hide resolved
agent/local/state.go Outdated Show resolved Hide resolved
agent/local/state.go Outdated Show resolved Hide resolved
agent/local/state.go Outdated Show resolved Hide resolved
@hanshasselberg hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 15, 2020
@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 15, 2020
@PumpkinSeed
Copy link
Contributor Author

I fixed nearly all of the suggested things. But I already lost my domain knowledge since I created the PR and also it's started to get a hell of changes and I can't oversee it properly.
So what do you think about a bit of restructure? @i0rek @freddygv
Let me explain my suggestion:

  • ServiceState and CheckState should get their own file line service_state.go and check_state.go
  • They should have methods for changing their InSync, because there are more places where we want to call it, and I think it make sense to have a general solution which checks everything properly before we want to perform that.
  • I prefer to move the related State struct's methods along with their main struct. ex.: func (l *State) CheckState(id structs.CheckID) *CheckState goes to check_state.go

Then when I oversee the problem again properly we can get back to the bigger issue what you told me. Or if you don't want to have this restructure I can get back my domain on the current implementation. But I recommend that, because it's really hard to manage it right now I guess.

Let me know what you think.

@freddygv
Copy link
Contributor

freddygv commented Feb 6, 2020

@PumpkinSeed, after having a team-wide discussion today, we would like to suggest that our team makes the remaining changes to this PR.

Since the anti-entropy sync is so fundamental in the operation of Consul, we need to spend a significant amount of additional time making sure that:

  • Remaining corner-cases are addressed
  • Everything works as expected
  • There is appropriate test coverage

Due to the time commitment needed we also need to decide how to prioritize this work relative to other features on our roadmap. So the enhancement could end up on hold again because of that.

Lastly, please note that we would like to preserve the changes that you have made, and if we do, we will ensure that you are credited in the git history.

Thanks again for your contribution. We really appreciate the time and effort you have put into this problem.

@PumpkinSeed
Copy link
Contributor Author

PumpkinSeed commented Feb 14, 2020

@freddygv
Probably this is the best for all of us. Because I'm totally out of context. Thank you for that help.

@jacksontj
Copy link

@freddygv any updated ETA on this fix? I've bee hit by this issue a couple of times and am anxious to get a fix in place.

@freddygv
Copy link
Contributor

Hi @jacksontj, we picked this back up earlier this week. It should make its way into the next release as long as we feel comfortable with the changes.

@jacksontj
Copy link

@freddygv Just checking in, any update on this issue?

@jacksontj
Copy link

@freddygv any update on this issue?

@freddygv
Copy link
Contributor

Hi @jacksontj we have been working on and off on the PR, but didn't have much time due to the 1.8.0 release. Going to take another pass at it next week and should have a more concrete update then.

@jacksontj
Copy link

@freddygv just checking in again here, really excited to get this fix in :)

@freddygv
Copy link
Contributor

freddygv commented Aug 11, 2020

@jacksontj we're still working on an implementation on a new branch. I'll post here when we have PRs out.

@jacksontj
Copy link

@freddygv checking in now thats it been another 8 months :)

@hanshasselberg
Copy link
Member

@jacksontj thanks for reaching out! We worked on this change quite a bit but don't have anything to show so far. This is a really challenging task because it changes Consul behaviour significantly. And while we still think this is a great idea, we no longer think it can be easily done.
We decided to stop working on this until we have the time to properly think about the implications of this change in advance and then decided on how to implement it so that it doesn't break the current behaviour.

@dnephin
Copy link
Contributor

dnephin commented May 10, 2021

Thank you @PumpkinSeed for your work on this issue! Also thank you to everyone else who has helped or expressed interest in getting this fixed.

This is a difficult issue to solve and will likely require a substantial change to the agent/local package. It seems like this PR won't be able to merge, so I'm going to close it.

We have an issue (#6616) which describes the problem, and #8778 which is a potential first step for solving it. Please subscribe to those to be notified of any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants