Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update inFlight cache to avoid race condition on volume operation #924

Merged

Conversation

AndyXiangLi
Copy link
Contributor

Is this a bug fix or adding new feature?
Fixes #918
What is this PR about? / Why do we need it?
Per spec, there is no more than one call “in-flight” per volume at a given time.
Add another layer of protection on the driver to make sure apis are idempotent.
What testing is done?
e2e test on k8s 1.19

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 8, 2021
@k8s-ci-robot k8s-ci-robot requested review from d-nishi and ddebroy June 8, 2021 00:33
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 8, 2021
@coveralls
Copy link

coveralls commented Jun 8, 2021

Coverage Status

Coverage increased (+0.5%) to 79.496% when pulling 383299e on AndyXiangLi:create-volume-idempotent into 454fdb4 on kubernetes-sigs:master.

@@ -307,6 +314,14 @@ func (d *controllerService) ControllerPublishVolume(ctx context.Context, req *cs
return nil, status.Error(codes.InvalidArgument, errString)
}

// check if a request is already in-flight, VolumeId should be enough at this moment because Multi node writer mode is not support in the driver
// However, use req hash as key because eventually we need to support the ability to publish the same volume onto different nodes concurrently, and I don't see any harm here.
if ok := d.inFlight.Insert(req.String()); !ok {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we only support single writer mode now, I would say it is ok to just use volumeId as key here. However, I don't see any harm if we using combination of volumeId&nodeId, If customer is trying to attach same volume to different node now, this request will be reject in other steps. Advise?

Copy link
Contributor Author

@AndyXiangLi AndyXiangLi Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this block in ControllerPublishVolume and ControllerUnpublishVolume, e2e test is failing because of this. I think we need to dig deeper into the publish/unpublish volume scenario. It's not related to the issue though, how about we implement this in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which test? I agree, let's leave ControllerPublish/Attach for another PR, it's more complicated for sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec says: This operation MUST be idempotent. If the volume corresponding to the volume_id has already been published at the node corresponding to the node_id, and is compatible with the specified volume_capability and readonly flag, the Plugin MUST reply 0 OK.

So you are right, we should use volume_id and node_id.

If the attacher requests the same volume_id and different node_id (and volume is not multiattach, which we don't support yet anyway) the request will fail/get rejected anyway, I agree that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like detach just took too long? or somehow the lock/inflight didn't get released, although the code is simple so I don't know how that would be possible.

 I0608 00:51:24.912672       1 cloud.go:594] Waiting for volume "vol-059485ebcfd5f7282" state: actual=detaching, desired=detached
I0608 00:51:26.006673       1 cloud.go:594] Waiting for volume "vol-059485ebcfd5f7282" state: actual=detaching, desired=detached
I0608 00:51:27.918180       1 cloud.go:594] Waiting for volume "vol-059485ebcfd5f7282" state: actual=detaching, desired=detached
W0608 00:53:00.254490       1 cloud.go:535] Ignoring error from describe volume for volume "vol-0944244f08b5badd0"; will retry: "RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
E0608 00:53:28.688453       1 driver.go:119] GRPC error: rpc error: code = Aborted desc = ControllerUnpublishVolume for Volume vol-0944244f08b5badd0 and Node i-071ebf53811520106 is already in progress
E0608 00:54:18.254315       1 driver.go:119] GRPC error: rpc error: code = Aborted desc = ControllerUnpublishVolume for Volume vol-0944244f08b5badd0 and Node i-071ebf53811520106 is already in progress 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible for attach/detach to deadlock since in kubernetes world it's quite common/possible to have 1 replica Terminating on one Node and its vol trying to detach at the same time as 1 replica Pending on another Node and its vol trying to attach. This would mean we would need to index by ControllerPublish+volume_id+node_id or ontrollerUnpublish+volume_id+node_id

@@ -121,6 +121,13 @@ func (d *controllerService) CreateVolume(ctx context.Context, req *csi.CreateVol
return nil, status.Error(codes.InvalidArgument, errString)
}

// check if a request is already in-flight, move this block to the beginning of the request to protect idempotency of this API
if ok := d.inFlight.Insert(req.String()); !ok {
Copy link
Contributor

@wongma7 wongma7 Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we move this all the way to beginning of the func, even before validating the request?

Also I noticed it's inconsistent how some functions index * with volumeID but others with request...I think they should all use request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I look into the spec, it is saying there is no more than one call “in-flight” per volume at a given time. I was struggled to determine using volumeId or whole req. If we use req, we are able to guarantee no more than one request per operation for one volume, but some case like deleteVolume & createSnapshot on same VolumeId will be allowed. Just wondering should we care about these use cases, otherwise use whole request make sense to me..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's consult the spec and think about it rigorously for each operation, I am scared of races :D

CreateVolume: We should use name: https://github.com/container-storage-interface/spec/blob/master/spec.md#controller-service-rpc
DeleteVolume: We should use volume_id: https://github.com/container-storage-interface/spec/blob/master/spec.md#deletevolume
CreateSnapshot: We should use name: https://github.com/container-storage-interface/spec/blob/master/spec.md#createsnapshot
DeleteSnapshot: We should use snapshot_id: https://github.com/container-storage-interface/spec/blob/master/spec.md#deletesnapshot

obviously, for createvolume and createsnapshot, you don't know the ID yet, so you can't use it as an idempotency token, that's why the spec requires name.

@AndyXiangLi AndyXiangLi force-pushed the create-volume-idempotent branch 3 times, most recently from 86e49de to d7bdf0a Compare June 8, 2021 19:07
@wongma7
Copy link
Contributor

wongma7 commented Jun 8, 2021

for ControllerUnpublish, we get stuck in this loop

func (c *cloud) WaitForAttachmentState(ctx context.Context, volumeID, expectedState string, expectedInstance string, expectedDevice string, alreadyAssigned bool) (*ec2.VolumeAttachment, error) {
. the context gets cancelled after just 15 seconds by external-attacher https://github.com/kubernetes-csi/external-attacher/blob/e9f6477657cf1616e63405a9ad29aaf99ebfad70/pkg/controller/csi_handler.go#L574 https://github.com/kubernetes-csi/external-attacher/blob/e9f6477657cf1616e63405a9ad29aaf99ebfad70/cmd/csi-attacher/main.go#L59 but DescribeVolumesWithContext keeps getting retried with the cancelled context
response, err := c.ec2.DescribeVolumesWithContext(ctx, request)

I think the fix is to check ctx.Err() with every loop iteration, if it has been cancelled we should exit the loop. External-attacher will then call ControllerUnpublish again and start another loop

@wongma7
Copy link
Contributor

wongma7 commented Jun 8, 2021

The alternative is to increase external-attacher context timeout to be much higher, like 45 minutes

// Most attach/detach operations on AWS finish within 1-4 seconds.

In that case, our own retry logic will dictate the frequency we call DescribeVolumes instead of external-attacher.

@AndyXiangLi AndyXiangLi closed this Jun 8, 2021
@AndyXiangLi AndyXiangLi reopened this Jun 8, 2021
@AndyXiangLi AndyXiangLi force-pushed the create-volume-idempotent branch from d7bdf0a to 38663cb Compare June 8, 2021 21:36
@AndyXiangLi
Copy link
Contributor Author

/test pull-aws-ebs-csi-driver-external-test

@@ -516,6 +516,9 @@ func (c *cloud) WaitForAttachmentState(ctx context.Context, volumeID, expectedSt
},
}

if ctx.Err() != nil && ctx.Err() == context.Canceled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we need the && to check of the err is canceled. if there's any err, exit

@AndyXiangLi AndyXiangLi force-pushed the create-volume-idempotent branch 2 times, most recently from 72b6c04 to 311f2f1 Compare June 8, 2021 23:43
@AndyXiangLi AndyXiangLi force-pushed the create-volume-idempotent branch from 311f2f1 to 383299e Compare June 9, 2021 00:45
@wongma7
Copy link
Contributor

wongma7 commented Jun 9, 2021

/lgtm
/approve
tyvm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndyXiangLi, wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [AndyXiangLi,wongma7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ialidzhikov
Copy link
Contributor

Thank you @AndyXiangLi and @wongma7 ! Is it possible to backport the fix to the affected versions that are still maintained/patched?

@ialidzhikov
Copy link
Contributor

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 11, 2021
@ialidzhikov
Copy link
Contributor

ping @AndyXiangLi @wongma7

@wongma7
Copy link
Contributor

wongma7 commented Jun 11, 2021

yes we can do a 1.1.1, if not today then next week. There is also an xfs bugfix that I think deserves release

@ialidzhikov
Copy link
Contributor

yes we can do a 1.1.1, if not today then next week. There is also an xfs bugfix that I think deserves release

@wongma7 @AndyXiangLi , is there any update wrt to v1.1.1? We are still looking forward to it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
5 participants