-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: support external-health-monitor #210
Conversation
/assign |
This is a new feature. Please add release notes. |
The implementation of support external-health-monitor in host-path-driver has been finished. I have tested this PR in my personal env. But I'm not sure If I need to change the orchestration file in deploy directory to add related argument of deploying external-health-monitor-controller and agent. @xing-yang After you reviewed the implementation, please tell me if i need to do it |
Add a release note. |
Can you take a look at the CI failures? Did you do |
pkg/hostpath/healthcheck.go
Outdated
return false, "The source path of the volume doesn't exist" | ||
} | ||
|
||
mpExist, err := checkMountPointExist(volumeHandle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only this mountpoint check is relevant for NodeGetVolumeStats.
Other checks are for ListVolumes() and GetVolume() from the controller side.
pkg/hostpath/nodeserver.go
Outdated
return nil, status.Error(codes.NotFound, "The volume not found") | ||
} | ||
|
||
healthy, msg := doHealthCheck(in.GetVolumeId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are calling the same doHealthCheck
for NodeGetVolumeStats, ListVolumes, and GetVolume. That will produce duplicate events on PVCs and Pods. Health check from controller side should be separate from node side so there should not be duplicate events.
For NodeGetVolumeStats, we should only need to check things like the mount condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, it’s hard to separate health check cases in the host-path driver. Because host-path-driver runs all of CSI components on the same node.
But, to avoid duplicate the event, it's one of the reasons we need to separate them.
/assign @NickrenREN |
hostpath unit task still fails:
|
@NickrenREN I already fixed it in this PR kubernetes-csi/csi-release-tools#128. We need to merge kubernetes-csi/csi-release-tools#128 first. And submit a new PR to host-path repo to update prow.sh by git subtree command. Finally, we can rebase this PR to resolve all of test failure |
get it, thanks |
thanks for this. /lgtm |
Under the section "Special notes for your reviewer" in the PR description, can you add more details on what are checked and what are reported as abnormal volume condition in the health monitor controller and agent respectively? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fengzixu, xing-yang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
3 similar comments
/retest |
/retest |
/retest |
@fengzixu: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -4,6 +4,6 @@ LABEL description="HostPath Driver" | |||
ARG binary=./bin/hostpathplugin | |||
|
|||
# Add util-linux to get a new version of losetup. | |||
RUN apk add util-linux | |||
RUN apk add util-linux && apk update && apk upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this related to "support external-health-monitor"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I implement this feature, I found I need run findmnt command with json
argument. It was supported in newer version. So, I update it
fmt.Printf("Failed to run driver: %s", err.Error()) | ||
os.Exit(1) | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here? This looks like an unrelated enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error may be returned from that function, just checking it right?
Updating E2E testing in Kubernetes from csi-driver-host-path 1.4.0 to 1.6.2 causes jobs to fail and the code from this PR seems to be involved, see kubernetes/kubernetes#100637 (comment). |
Do we have any test for the |
for _, pv := range mountInfosOfPod.ContainerFileSystem { | ||
if !strings.Contains(pv.Target, csiSignOfVolumeTargetPath) { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this code distinguish between mounts from some other CSI driver and the mounts of this hostpath driver instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't.
Both podVolumeTargetPath = /var/lib/kubelet/pods
and csiSignOfVolumeTargetPath = kubernetes.io~csi/pvc
are shared with all other CSI driver instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the obvious failure of adding volumes that aren't actually managed by this driver, we also get a race condition that leads to the "failed to get capacity info: no such file or directory" error from kubernetes/kubernetes#100637 (comment)
- one CSI driver starts listing mounts
- another CSI driver concurrently unmounts
- the first CSI driver tries to get capacity information which then fails
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is discoveryExistingVolumes
related to volume health checks?
I understand the desire to support container restarts. I just don't understand how it is related to health checks.
Regarding container restarts: the information restored is incomplete (for example, the more recently added nominal capacity of a volume cannot be discovered). Instead of making this code more complex, can we simply rip it out and go back to the state from the hostpath 1.4 release, i.e. the driver must not be restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with discovering snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the implementation of this volume discovery: wouldn't it be better to use a JSON file in the dataDir
to store the in-memory list each time changes are made to it? Then we can restore all information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing upgrade for moving snapshot from v1beta1 to v1, for example, driver restart will happen. So we need to handle driver restarts.
I'm open to different ways of implementing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you needed that for some other purpose. Now it makes more sense to me.
I started implementing this, but won't have time to finish it. If someone else wants to take a stab at it, feel free.
IMHO the code which Volume and Snaphshot structs and the corresponding maps should be in its own package, with get/add/update/delete functions that have to be used to make changes. Currently it's in the same package and I already found code which directly manipulated the maps instead of using the existing helper functions.
Then in that package, we can dump the entire struct into one file to store volumes and snapshots. That removes a whole lot of custom code, including the one which we discuss here. Combine that with a configurable data dir and we can add unit tests for this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fengzixu can you take a stab at this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let me try
b54c1ba Merge pull request kubernetes-csi#246 from xing-yang/go_1.21 5436c81 Change go version to 1.21.5 267b40e Merge pull request kubernetes-csi#244 from carlory/sig-storage b42e5a2 nominate self (carlory) as kubernetes-csi reviewer a17f536 Merge pull request kubernetes-csi#210 from sunnylovestiramisu/sidecar 011033d Use set -x instead of die 5deaf66 Add wrapper script for sidecar release git-subtree-dir: release-tools git-subtree-split: b54c1ba
Add wrapper script for sidecar release
What type of PR is this?
/kind feature
What this PR does / why we need it:
Implementing the
ListVolumes
,GetVolume
andNodeStats
interfaces to support external-health-monitor projectWhich issue(s) this PR fixes:
Fixes kubernetes-csi/external-health-monitor#21
Special notes for your reviewer:
Does this PR introduce a user-facing change?: