-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS CSI driver deleted the volume for PV but has not updated the PV spec and it keep trying to attach this deleted volume to the node #918
Comments
I am cc-ing the folks that were involved with #771 (as this is PR that introduces the race condition) |
The inFlight requests map is intended to insulate us from races like this. Haven't had a chance to dig deep yet but I will be checking that first. |
If by the inFlight requests map, you mean the logic in pkg/driver/controller.go#L220-L225 aws-ebs-csi-driver/pkg/driver/controller.go Lines 220 to 225 in f678406
then this logic is executed after the "early exit" for already existing volume (pkg/driver/controller.go#L212-L218) aws-ebs-csi-driver/pkg/driver/controller.go Lines 212 to 218 in f678406
So that's why I think that the inFlight requests map is not helping here. I guess one potential fix could be in the "early exit" for already existing volume - when the volume is not |
/priority important-soon |
@wongma7 I think either
|
in fact it's probably safer to do the same for ALL functions that, according to the spec, MUST be idempotent. i.e. wrap DeleteVolume, Create/DeleteSnapshot, etc. Otherwise it's too hard for us to avoid all potential race conditions when multiple calls are in flight. We cannot trust kubelet/external-provisioner/external-attacher to keep track of multiple calls, they can restart at any time and lose track, and the spec only says they "SHOULD ensure that there are no other calls", so the responsibility falls on the driver to keep trakc. |
/kind bug
What happened?
Together with @ialidzhikov we observed in our cluster multiple statefulset apps cannot start due to volume failed to be attached.
It turn out that the volume for the PV is actually deleted by the driver during the creation.
CreateVolume
CreateVolume
requestCreateVolume
request has seen the volume and returned the volume id.CreateVolume
request now fails and send delete request (introduced with delete leaked volume if driver don't know the volume status #771) for the volume.Here are the logs from the CSI components (
external-attacher
,aws-ebs-csi-driver
,external-provisioner
)What you expected to happen?
The PV should not be left in a broken state where the referenced volume is deleted.
How to reproduce it (as minimally and precisely as possible)?
Not applicable, but see the steps above that describe the race condition.
Anything else we need to know?:
Environment
kubectl version
): v1.18.16The text was updated successfully, but these errors were encountered: