Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CephFS: prevent hanging NodeGetVolumeStats on stat() syscall when an MDS is slow #4200

Merged
merged 4 commits into from
Nov 3, 2023

Conversation

nixpanic
Copy link
Member

@nixpanic nixpanic commented Oct 17, 2023

When an MDS is slow, the NodeGetVolumeStats CSI procedure may hang for a while. In case the MDS does not respond at all (or the reply is lost), the stat() syscall may hang indefinitely.

Because Kubelet retries the NodeGetVolumeStats CSI procedure regularly, the number of hanging stat() syscalls can pike up. Each hanging call will consume a thread (go routine), and this may eventually starve the processing capabilities of the CephFS CSI-nodeplugin container.

The commits in this PR come from #4125, which is planned to have a little more features (like a separate data/ directory for the contents of the volume).

Manual Testing

  1. create a PVC and start a Pod that uses it
  2. let it run for a few minutes, confirm that NodeGetVolumeStats calls are in the CephFS node-plugin logs
  3. scale down the MDS(s), to simulate a failure
  4. check the CephFS node-plugin logs and see if there are unhealthy messages for NodeGetVolumeStats calls

Results (see GRPC response):

I1024 12:31:13.033160       1 utils.go:195] ID: 11 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1024 12:31:13.033195       1 utils.go:206] ID: 11 GRPC request: {}
I1024 12:31:13.033274       1 utils.go:212] ID: 11 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":5}}}]}
I1024 12:31:13.034119       1 utils.go:195] ID: 12 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1024 12:31:13.034261       1 utils.go:206] ID: 12 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-af951451-f063-4a67-bf7c-8e788bdc587c","volume_path":"/var/lib/kubelet/pods/b2a6ae50-16a9-4326-83d5-58aafcb70f90/volumes/kubernetes.io~csi/pvc-66c3bfe6-1046-438b-8070-60b070b71149/mount"}
I1024 12:31:13.035323       1 utils.go:212] ID: 12 GRPC response: {"usage":[{"available":1073741824,"total":1073741824,"unit":1}],"volume_condition":{"message":"volume is in a healthy condition"}}

I1024 12:32:33.878002       1 utils.go:195] ID: 13 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1024 12:32:33.878281       1 utils.go:206] ID: 13 GRPC request: {}
I1024 12:32:33.878379       1 utils.go:212] ID: 13 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":5}}}]}
I1024 12:32:33.879443       1 utils.go:195] ID: 14 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1024 12:32:33.879502       1 utils.go:206] ID: 14 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-af951451-f063-4a67-bf7c-8e788bdc587c","volume_path":"/var/lib/kubelet/pods/b2a6ae50-16a9-4326-83d5-58aafcb70f90/volumes/kubernetes.io~csi/pvc-66c3bfe6-1046-438b-8070-60b070b71149/mount"}
I1024 12:32:33.879549       1 utils.go:212] ID: 14 GRPC response: {"volume_condition":{"abnormal":true,"message":"health-check has not responded for 119.215949 seconds"}}
I1024 12:34:17.335423       1 utils.go:195] ID: 15 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1024 12:34:17.335461       1 utils.go:206] ID: 15 GRPC request: {}
I1024 12:34:17.335542       1 utils.go:212] ID: 15 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":5}}}]}
I1024 12:34:17.336271       1 utils.go:195] ID: 16 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1024 12:34:17.336350       1 utils.go:206] ID: 16 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-af951451-f063-4a67-bf7c-8e788bdc587c","volume_path":"/var/lib/kubelet/pods/b2a6ae50-16a9-4326-83d5-58aafcb70f90/volumes/kubernetes.io~csi/pvc-66c3bfe6-1046-438b-8070-60b070b71149/mount"}
I1024 12:34:17.336442       1 utils.go:212] ID: 16 GRPC response: {"volume_condition":{"abnormal":true,"message":"health-check has not responded for 222.672824 seconds"}}
I1024 12:35:55.149537       1 utils.go:195] ID: 17 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1024 12:35:55.149583       1 utils.go:206] ID: 17 GRPC request: {}
I1024 12:35:55.149667       1 utils.go:212] ID: 17 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":5}}}]}
I1024 12:35:55.150294       1 utils.go:195] ID: 18 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1024 12:35:55.150331       1 utils.go:206] ID: 18 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-af951451-f063-4a67-bf7c-8e788bdc587c","volume_path":"/var/lib/kubelet/pods/b2a6ae50-16a9-4326-83d5-58aafcb70f90/volumes/kubernetes.io~csi/pvc-66c3bfe6-1046-438b-8070-60b070b71149/mount"}
I1024 12:35:55.150431       1 utils.go:212] ID: 18 GRPC response: {"volume_condition":{"abnormal":true,"message":"health-check has not responded for 320.486780 seconds"}}

Scale up the MDS(s) again, see that the volume becomes healthy again:

I1024 12:39:03.241203       1 utils.go:195] ID: 21 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1024 12:39:03.241236       1 utils.go:206] ID: 21 GRPC request: {}
I1024 12:39:03.241308       1 utils.go:212] ID: 21 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":5}}}]}
I1024 12:39:03.242094       1 utils.go:195] ID: 22 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1024 12:39:03.242132       1 utils.go:206] ID: 22 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-af951451-f063-4a67-bf7c-8e788bdc587c","volume_path":"/var/lib/kubelet/pods/b2a6ae50-16a9-4326-83d5-58aafcb70f90/volumes/kubernetes.io~csi/pvc-66c3bfe6-1046-438b-8070-60b070b71149/mount"}
I1024 12:39:03.242220       1 utils.go:212] ID: 22 GRPC response: {"volume_condition":{"abnormal":true,"message":"health-check has not responded for 448.545782 seconds"}}
I1024 12:40:40.223909       1 utils.go:195] ID: 23 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1024 12:40:40.223962       1 utils.go:206] ID: 23 GRPC request: {}
I1024 12:40:40.224044       1 utils.go:212] ID: 23 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":4}}},{"Type":{"Rpc":{"type":5}}}]}
I1024 12:40:40.224870       1 utils.go:195] ID: 24 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1024 12:40:40.224989       1 utils.go:206] ID: 24 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-af951451-f063-4a67-bf7c-8e788bdc587c","volume_path":"/var/lib/kubelet/pods/b2a6ae50-16a9-4326-83d5-58aafcb70f90/volumes/kubernetes.io~csi/pvc-66c3bfe6-1046-438b-8070-60b070b71149/mount"}
I1024 12:40:40.225807       1 utils.go:212] ID: 24 GRPC response: {"usage":[{"available":1073741824,"total":1073741824,"unit":1}],"volume_condition":{"message":"volume is in a healthy condition"}}

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@nixpanic nixpanic force-pushed the bug/2244562 branch 2 times, most recently from bbeced9 to 4e8007a Compare October 20, 2023 08:02
@nixpanic nixpanic marked this pull request as draft October 20, 2023 08:02
@nixpanic nixpanic added the component/cephfs Issues related to CephFS label Oct 24, 2023
@nixpanic nixpanic marked this pull request as ready for review October 24, 2023 14:25
@nixpanic nixpanic requested a review from a team October 24, 2023 14:25
Comment on lines 275 to 278
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath)
log.WarningLog(ctx, "failed to start healtchecker: %v", err)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the volume is already mounted (case where csi plugin just restarted)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, there it no health-checker.

Possibly NodeGetVolumeStats can be extended to start the checker to catch this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this , can we have a tracker for this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense. #4219 has been created for this.

@@ -599,6 +606,9 @@ func (ns *NodeServer) NodeUnstageVolume(
}

volID := req.GetVolumeId()

ns.healthChecker.StopChecker(volID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should happen right before umount but after all other operations right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be done early, yes. Unmounting may not be possible if the checker writes/reads from the mountpoint at the same time.

@@ -694,6 +711,18 @@ func (ns *NodeServer) NodeGetVolumeStats(
return nil, status.Error(codes.InvalidArgument, err.Error())
}

// health check first, return without stats if unhealthy
healthy, msg := ns.healthChecker.IsHealthy(req.GetVolumeId())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the NodeGetVolumeStats is called right after the NodeStageVolume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before running the 1st check, the volume is regarded healthy. It is the assumption that when mounting works, the volume is still healthy shortly after.

Comment on lines 744 to 748
if err != nil {
return nil, err
}

return res, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need this check and can have return csicommon.FilesystemNodeGetVolumeStats(ctx, ns.Mounter, targetPath, false) as it was before

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes. It was modified in an earlier version to set res.VolumeCondition = ... after getting the stats. ... that wasn't my smartest move, as getting the stats got hung, so the VolumeCondition was never set and returned 🤦‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 27 to 34
// command is what is sent through the channel to terminate the go routine.
type command string

const (
// stopCommand is sent through the channel to stop checking.
stopCommand = command("STOP")
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make use of empty struct to signal between go routines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is possible, but what is the advantage?

I've chosen this way to make it extendable if needed, it does not matter at the moment what is sent over the channel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases empty structs are preferred to signal in the go routines, that's why the suggestion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a recommendation that is documented anywhere? I have seen several examples with integers and booleans, or structs that contain one or more members...

Comment on lines 156 to 150
fc.healthy = false
fc.err = fmt.Errorf("health-check has not responded for %f seconds", delay.Seconds())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any problem with multiple goroutines trying to update the same variables?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. There is a very small race in case one go-routine does not enter this if-statement, and an other does + updates the fc.err before the 1st go-routines exits the function. I don't think it is a realistic failure scenario to account for.

A same way can happen with this function and concurrent runChecker(). But it also looks quite an edge case.

If you insist, I can add some locking around updating/reading fc.healthy and fc.err.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locking has been added.

return fmt.Errorf("failed to created workdir %q for health-checker: %w", workdir, err)
}

cc := newFileChecker(workdir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always starts a FileChecker what about BlockChecker for future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a BlockChecker is introduced, the Manager interface will need some modifications in any case.

Comment on lines +83 to +155
} else {
// 'cc' was stored, start it only once
cc.start()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen only for Static PVC? because the NodeStageVolume will be called only once per volumeID

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeStageVolume might have failed/timedout after the checker was started. Calling NodeStageVolume again should be a no-op in that case, this just ensures some idempotency.


cc := newFileChecker(workdir)

// load the 'old' ConditionChecker if it exists, otherwuse store 'cc'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwuse to otherwise

internal/health-checker/manager.go Outdated Show resolved Hide resolved
@nixpanic
Copy link
Member Author

@Madhu-1 and @riya-singhal31 , I have manually tested this again (by scaling MDS's down/up) and it still works as before. Could you review again and mark the addressed comments as resolved? Many thanks!

@@ -270,6 +272,9 @@ func (ns *NodeServer) NodeStageVolume(
}
}

err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath)
log.WarningLog(ctx, "failed to start healtchecker: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.WarningLog(ctx, "failed to start healtchecker: %v", err)
log.WarningLog(ctx, "failed to start healthchecker: %v", err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions, Mostly LGTM

}

func (hcm *healthCheckManager) StartChecker(volumeID, path string) error {
workdir := filepath.Join(path, ".csi")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the PVC already has this path or the application doesn't expect any folders inside the mounted directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other filesystems also maintain their own metadata in the root, like lost+found on ext4. Applications that (do not) expect certain filesystem internal metadata should be quite rare, as they are not very portable.

Would you prefer to have an option per volume to use a different directory, or disable it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree on that one, this is something we are introducing, but not all the application owners/developers will be aware of it. IMHO we should not write or touch anything in the path that is provided to the application pods, I would prefer the below

  • Mount the subvolume to the stagingPath and create a new directory as in the design doc and mount that directory to the targetPath
  • Do health check only inside the stagingPath which is not available to the application pod

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For legacy volumes, we could also do a minimal stat() on the root of the volume, instead of writing/reading a timestamp. It might not be able to detect all failures scenarios, but it would be a non-intrusive check.

Newly created volumes can have their own data subdirectory that is mapped for use inside of the container, leaving the root of the volume for contents that is not visible to applications. That can be done in PR #4125 once this one is accepted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For legacy volumes, we could also do a minimal stat() on the root of the volume, instead of writing/reading a timestamp. It might not be able to detect all failures scenarios, but it would be a non-intrusive check.

It's a temporary problem, once the node gets rebooted or the application gets restagted we will move to the new format. We can live with it.

Newly created volumes can have their own data subdirectory that is mapped for use inside of the container, leaving the root of the volume for contents that is not visible to applications. That can be done in PR #4125 once this one is accepted.

Sounds good 👍🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added statChecker and will use that in the nodeserver for now. fileChecker is still included so that the refactoring with a base checker makes sense.

@@ -270,6 +272,9 @@ func (ns *NodeServer) NodeStageVolume(
}
}

err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same check needs to be done at line 233 where we are returning success if the directory is already mounted

Comment on lines 275 to 278
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath)
log.WarningLog(ctx, "failed to start healtchecker: %v", err)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this , can we have a tracker for this one?

@@ -270,6 +275,9 @@ func (ns *NodeServer) NodeStageVolume(
}
}

err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath)
log.WarningLog(ctx, "failed to start healthchecker: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done inside if err !=nil check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -228,6 +230,9 @@ func (ns *NodeServer) NodeStageVolume(
return nil, status.Error(codes.Internal, err.Error())
}

err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath)
log.WarningLog(ctx, "failed to start healthchecker: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done inside if err !=nil check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

// load the 'old' ConditionChecker if it exists
old, ok := hcm.checkers.Load(volumeID)
if !ok {
return true, fmt.Errorf("no ConditionChecker for volume-id: %s", volumeID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error will not checked as we are returning volume as Healthy and also the error message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for that now, will start a checker in that case too.

// 'old' was loaded, cast it to ConditionChecker
cc, ok := old.(ConditionChecker)
if !ok {
return true, fmt.Errorf("failed to cast cc to ConditionChecker for volume-id %q", volumeID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error will not checked as we are returning volume as Healthy and also the error message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error in the internals of the health-checker, it is assumed that the volume is healthy. Only when there really is a problem with the volume, the status is reported as unhealthy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log and also document this one? we might need to have better logging to check whether the health checker is running and what was the latest run details (by looking at the logs), maybe we can do it as an enhancement later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more docs about it now.

@nixpanic
Copy link
Member Author

nixpanic commented Nov 2, 2023

Lots of manual testing done. including the following combinations:

pod uses PVC and

  1. is healthy
  2. csi-cephfsplugin pods restarted, volume stays healthy
  3. MDS' are stopped, volume becomes unhealthy
  4. MDS' restarted, volume becomes healthy again
  5. csi-cephfsplugin pods restarted when MDS' are unavailable, volume stays unhealthy
  6. csi-cephfsplugin restarted while unhealthy, MDS' restarted, volume becomes healthy again

riya-singhal31
riya-singhal31 previously approved these changes Nov 3, 2023
Copy link
Contributor

@riya-singhal31 riya-singhal31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niels, LGTM

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a question and small nit

// volume is already staged and published.
if healthy && msg != nil {
// Start a StatChecker for the mounted targetPath, yhis prevents
// writing a file in the user-visible location. Ideally a (shared)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yhis to this

// writing a file in the user-visible location. Ideally a (shared)
// FileChecker is started with the stagingTargetPath, but we can't
// get the stagingPath from the request.
err = ns.healthChecker.StartChecker(req.GetVolumeId(), targetPath, hc.StatCheckerType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use something like below to get the stagingPath from targetPath

func (ns *NodeServer) getStagingPath(volPath string) (string, error) {
mounts, err := ns.Mounter.GetMountRefs(volPath)
if err != nil {
return "", err
}
for _, mount := range mounts {
// strip the last directory from the staging path
stp := strings.Split(mount, "/")
stagingTargetPath := strings.Join(stp[:len(stp)-1], "/")
if checkRBDImageMetadataStashExists(stagingTargetPath) {
return stagingTargetPath, nil
}
}
return "", fmt.Errorf("failed to get staging path for volume %s", volPath)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this could be an enhancement. I am not sure if it is worth the effort and additional complexity though.

When a FileChecker is used, it would make more sense as that could detect the existence of the file holding the timestamp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a TODO comment with reference to rbd.getStagingPath()

// writing a file in the user-visible location. Ideally a (shared)
// FileChecker is started with the stagingTargetPath, but we can't
// get the stagingPath from the request.
err = ns.healthChecker.StartChecker(req.GetVolumeId(), targetPath, hc.StatCheckerType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also see a very small race for the below case

  • During NodeUnpublish cephcsi stopped the health checker
  • Before umount operation we got a NodeGetVolumeStats RPC call which inturn started a health checker
  • Now we have a stale health check which will be cleanedup only during NodeUnstage (i think)

Is it possible to have such a case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not expect the CO to request NodeUnpublish and then still use the same publishTargetPath in a later NodeGetVolumeStats. If a CO does that, then yes, there would be a stale health checker (if it was started in a NodeGetVolumeStats call, and not in NodeStageVolume).

@nixpanic
Copy link
Member Author

nixpanic commented Nov 3, 2023

@Madhu-1 I have addressed your latest concerns as well. Please check again, thanks!

@mergify mergify bot dismissed riya-singhal31’s stale review November 3, 2023 08:35

Pull request has been modified.

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @nixpanic

@nixpanic
Copy link
Member Author

nixpanic commented Nov 3, 2023

@Mergifyio rebase

Signed-off-by: Niels de Vos <ndevos@ibm.com>
Signed-off-by: Niels de Vos <ndevos@ibm.com>
The HealthChecker is configured to use the Staging path pf the volume,
with a `.csi/` subdirectory. In the future this directory could be a
directory that is not under the Published directory.

Fixes: ceph#4219
Signed-off-by: Niels de Vos <ndevos@ibm.com>
When FilesystemNodeGetVolumeStats() succeeds, the volume must be
healthy. This can be included in the VolumeCondition CSI message by
default.

Checks that detect an abnormal VolumeCondition should prevent calling
FilesystemNodeGetVolumeStats() as it is possible that the function will
hang.

Signed-off-by: Niels de Vos <ndevos@ibm.com>
Copy link
Contributor

mergify bot commented Nov 3, 2023

rebase

✅ Branch has been successfully rebased

@nixpanic
Copy link
Member Author

nixpanic commented Nov 3, 2023

@Mergifyio queue

Copy link
Contributor

mergify bot commented Nov 3, 2023

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 4d3b1fc

@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Nov 3, 2023
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.27

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.26

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.28

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.26

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 3, 2023
@mergify mergify bot merged commit 4d3b1fc into ceph:devel Nov 3, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cephfs Issues related to CephFS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants