Node service vs idempotency #307

bertinatto · 2019-06-11T08:58:50Z

/kind bug

I think our inFlight object isn't providing idempotency for the node service. For example:

CO calls NodeStageVolume for a 500 GB volume.
The driver starts formatting the volume.
CO calls calls NodeStageVolume again for the same volume.
The driver returns an error [0] because the first request NodeStageVolume call hasn't returned yet (the driver is still formatting the volume).

Since the call is supposed to be idempotent, we shouldn't return an error in step 4.

I believe the best option is to have a per-volume lock, so any operation (stage, publish, resize etc.) on a specific volume will be executed synchronously.

[0] https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v0.3.0/pkg/driver/node.go#L84-L91

CC @leakingtapan @dkoshkin

The text was updated successfully, but these errors were encountered:

leakingtapan · 2019-06-11T18:29:31Z

The inFlight object locks around the NodeStageVolumeRequest which contains the volume ID with other fields like GetStagingTargetPath, what if there is an node request for the same volume but different path?

bertinatto · 2019-06-12T08:22:55Z

The inFlight object locks around the NodeStageVolumeRequest which contains the volume ID with other fields like GetStagingTargetPath

I think the inFlight object only locks when the key (which is a request's hash containing the volume ID and other fields, like you mentioned above) is being inserted and deleted:

aws-ebs-csi-driver/pkg/driver/internal/inflight.go

Lines 48 to 61 in 2ea44a9

    
           func (db *InFlight) Insert(entry Idempotent) bool { 
        
           	db.mux.Lock() 
        
           	defer db.mux.Unlock() 
        
           	hash := entry.String() 
        
           	_, ok := db.inFlight[hash] 
        
           	if ok { 
        
           		return false 
        
           	} 
        
           	db.inFlight[hash] = true 
        
           	return true 
        
           }

It won't lock until a NodeStageVolume call completes to start processing the second one, it'll return an error for the latter. Note that the responses of these two calls (which are the equally addressed to the same volume, path etc.) will be different (no idempotency).

what if there is an node request for the same volume but different path?

Currently the driver will process both requests at the same time, which is OK in this case. However, if the request is for the same volume and same path, it'll return an error, which is not correct.

Note that the problem is not about the key being used, but instead the error returned by the driver. According to the specs, the key should contain the volume ID, the path and the capabilities. We are mostly OK here, but we can't return an error if the same key is being processed:

This operation MUST be idempotent. If the volume corresponding to the volume_id is already staged to the staging_target_path, and is identical to the specified volume_capability the Plugin MUST reply 0 OK.

fejta-bot · 2019-09-10T09:15:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.