Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: fix scheduler/resource claim controller race #124931

Merged
merged 2 commits into from
Jun 28, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 17, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

There was a race caused by having to update claim finalizer and status in two different operations:

  • Resource claim controller removes allocation, does not yet get to remove the finalizer.
  • Scheduler prepares an allocation, without adding the finalizer because it's there.
  • Controller removes finalizer.
  • Scheduler adds allocation.

This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur.

Special notes for your reviewer:

The fix is to check that the finalizer is still set when adding the allocation. This can be done with a complicated JSON patch (see first commit, but only if you are really, really curious!), but a local retry loop with Update calls is simpler.

The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.

Does this PR introduce a user-facing change?

DRA: using structured parameters with a claim that gets reused between pods may have led to a claim with an invalid state (allocated without a finalizer) which then caused scheduling of pods using the claim to stop.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 17, 2024
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and klueska May 17, 2024 14:45
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 17, 2024
// JSON patch can only append to a non-empty array. An empty reservedFor gets
// omitted and even if it didn't, it would be null and not an empty array.
// Therefore we have to test and add if it's currently empty.
reservedForEntry := fmt.Sprintf(`{"resource": "pods", "name": %q, "uid": %q}`, pod.Name, pod.UID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it was fun playing around with JSON patch, I think this is taking it too far...

A simpler, more obvious approach would be to add a retry loop which uses normal Update calls and gets the latest claim on a conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented, ready for review again.

@pohly pohly changed the title DRA: fix scheduler/resource claim controller race WIP: DRA: fix scheduler/resource claim controller race May 21, 2024
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 21, 2024
@pohly pohly changed the title WIP: DRA: fix scheduler/resource claim controller race DRA: fix scheduler/resource claim controller race May 27, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2024
@pohly pohly force-pushed the dra-scheduler-prebind-fix branch from 843dca1 to 434e786 Compare May 28, 2024 07:49
@pohly
Copy link
Contributor Author

pohly commented May 28, 2024

/retest

@bart0sh
Copy link
Contributor

bart0sh commented May 28, 2024

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 28, 2024
@pohly
Copy link
Contributor Author

pohly commented Jun 13, 2024

/assign @kerthcet

Can you help review?

@pohly pohly mentioned this pull request Jun 13, 2024
9 tasks
@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2024

/test pull-kubernetes-node-e2e-containerd-1-7-dra

Testing kubernetes/test-infra#32774

@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2024

/test pull-kubernetes-node-e2e-containerd-1-7-dra

Testing another PR: kubernetes/test-infra#32776

@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2024

/test pull-kubernetes-node-e2e-containerd-1-7-dra

@kerthcet
Copy link
Member

will finish the review tomorrow.

claim.Finalizers = append(claim.Finalizers, resourcev1alpha2.Finalizer)
updatedClaim, err := pl.clientset.ResourceV1alpha2().ResourceClaims(claim.Namespace).Update(ctx, claim, metav1.UpdateOptions{})
if err != nil {
if apierrors.IsConflict(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will retryOnConflicthelp here? Then we can remove for loop here, it seems risky to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint. It makes the code a bit simpler and adds exponential backoff.

I'm taking extra care to not do a GET in the first iteration because most of the time, the claim will be recent enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed.

I verified again that the retry loop works: when running _output/bin/ginkgo -v --focus="with translated parameters on single node.*supports sharing a claim sequentially" ./test/e2e, the scheduler had to retry once as indicated by the log message when it gets the newer claim.

}

// The finalizer needs to be added in a normal update.
// If we were interrupted in the past, it might already be set and we simply continue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get for the laster claim to check whether the finalizer has been removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it has been removed and the claim instance is stale, then the UpdateStatus below will fail and we retry with a fresh copy of the claim.

@pohly pohly force-pushed the dra-scheduler-prebind-fix branch from 434e786 to 952e6aa Compare June 27, 2024 12:56
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2024
@pohly pohly force-pushed the dra-scheduler-prebind-fix branch from 952e6aa to caf544a Compare June 27, 2024 13:01
pohly added 2 commits June 27, 2024 15:03
There was a race caused by having to update claim finalizer and status in two
different operations:
- Resource claim controller removes allocation, does not yet
  get to remove the finalizer.
- Scheduler prepares an allocation, without adding the finalizer
  because it's there.
- Controller removes finalizer.
- Scheduler adds allocation.

This is an invalid state. Automatic checking found this during the execution of
the "with translated parameters on single node.*supports sharing a claim
sequentially" E2E test, but only when run stand-alone. When running in
parallel (as in the CI), the bad outcome of the race did not occur.

The fix is to check that the finalizer is still set when adding the
allocation. The apiserver doesn't check that because it doesn't know which
finalizer goes with the allocation result. It could check for "some finalizer",
but that is not guaranteed to be correct (could be some unrelated one).

Checking the finalizer can only be done with a JSON patch. Despite the
complications, having the ability to add multiple pods concurrently to
ReservedFor seems worth it (avoids expensive rescheduling or a local retry
loop).

The resource claim controller doesn't need this, it can do a normal update
which implicitly checks ResourceVersion.
The JSON patch approach works, but it is complex. A retry loop is easier to
understand (detect conflict, get new claim, try again). There is one additional
API call (the get), but in practice this scenario is unlikely.
@pohly pohly force-pushed the dra-scheduler-prebind-fix branch from caf544a to 4bddebc Compare June 27, 2024 13:04
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2024
@kerthcet
Copy link
Member

/lgtm
/approve

The two commits seems exclusive, will you squash them?
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 28, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8a6a4b040255d3614b0addbe44eaafa77377c41f

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2024
@pohly
Copy link
Contributor Author

pohly commented Jun 28, 2024

I'd prefer to keep the two commits: the first one can serve as reference for how this would have looked like with JSON patching.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2024
@k8s-ci-robot k8s-ci-robot merged commit eb66365 into kubernetes:master Jun 28, 2024
18 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

4 participants