feature: support external-health-monitor #210

fengzixu · 2020-10-18T23:49:58Z

What type of PR is this?

/kind feature
What this PR does / why we need it:
Implementing the ListVolumes, GetVolume and NodeStats interfaces to support external-health-monitor project
Which issue(s) this PR fixes:

Fixes kubernetes-csi/external-health-monitor#21

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

support external-health-monitor feature in hostpath driver
- Check if volume mount path of pod exists or not
- Check if filesystem that volume relies on may be out of capacity
- Check if volume usage is almost full

xing-yang · 2020-10-19T02:48:50Z

/assign

xing-yang · 2020-10-19T02:49:03Z

This is a new feature. Please add release notes.

fengzixu · 2020-11-07T02:35:49Z

The implementation of support external-health-monitor in host-path-driver has been finished. I have tested this PR in my personal env. But I'm not sure If I need to change the orchestration file in deploy directory to add related argument of deploying external-health-monitor-controller and agent.

@xing-yang After you reviewed the implementation, please tell me if i need to do it

xing-yang · 2020-11-07T03:40:54Z

Add a release note.

xing-yang · 2020-11-07T03:42:15Z

Can you take a look at the CI failures? Did you do go mod vendor and go mod tidy?

deploy/kubernetes-1.17/hostpath/csi-hostpath-driverinfo.yaml

cmd/hostpathplugin/main.go

pkg/hostpath/controllerserver.go

pkg/hostpath/healthcheck.go

pkg/hostpath/hostpath.go

go.mod

xing-yang · 2020-11-15T15:02:01Z

pkg/hostpath/healthcheck.go

+		return false, "The source path of the volume doesn't exist"
+	}
+
+	mpExist, err := checkMountPointExist(volumeHandle)


I think only this mountpoint check is relevant for NodeGetVolumeStats.
Other checks are for ListVolumes() and GetVolume() from the controller side.

xing-yang · 2020-11-15T15:04:30Z

pkg/hostpath/nodeserver.go

+		return nil, status.Error(codes.NotFound, "The volume not found")
+	}
+
+	healthy, msg := doHealthCheck(in.GetVolumeId())


Looks like you are calling the same doHealthCheck for NodeGetVolumeStats, ListVolumes, and GetVolume. That will produce duplicate events on PVCs and Pods. Health check from controller side should be separate from node side so there should not be duplicate events.
For NodeGetVolumeStats, we should only need to check things like the mount condition.

Honestly, it’s hard to separate health check cases in the host-path driver. Because host-path-driver runs all of CSI components on the same node.

But, to avoid duplicate the event, it's one of the reasons we need to separate them.

xing-yang · 2020-11-15T15:06:20Z

/assign @NickrenREN

NickrenREN · 2021-01-20T02:12:12Z

hostpath unit task still fails:

make: *** [release-tools/build.make:259: test-shellcheck] Error 1
make: Target 'test' not remade because of errors.
ERROR: 'make test' failed
WARNING: 'make test' failed, proceeding anyway

fengzixu · 2021-01-20T03:12:05Z

hostpath unit task still fails:

make: *** [release-tools/build.make:259: test-shellcheck] Error 1
make: Target 'test' not remade because of errors.
ERROR: 'make test' failed
WARNING: 'make test' failed, proceeding anyway

@NickrenREN I already fixed it in this PR kubernetes-csi/csi-release-tools#128.
Actually, we cannot directly modify prow.sh in this PR, it is just for testing if other e2e test can be passed.

We need to merge kubernetes-csi/csi-release-tools#128 first. And submit a new PR to host-path repo to update prow.sh by git subtree command. Finally, we can rebase this PR to resolve all of test failure

NickrenREN · 2021-01-20T05:39:31Z

@NickrenREN I already fixed it in this PR kubernetes-csi/csi-release-tools#128.
Actually, we cannot directly modify prow.sh in this PR, it is just for testing if other e2e test can be passed.
We need to merge kubernetes-csi/csi-release-tools#128 first. And submit a new PR to host-path repo to update prow.sh by git subtree command. Finally, we can rebase this PR to resolve all of test failure

get it, thanks

NickrenREN · 2021-01-22T01:55:48Z

thanks for this.

/lgtm

xing-yang · 2021-01-25T17:40:01Z

Under the section "Special notes for your reviewer" in the PR description, can you add more details on what are checked and what are reported as abnormal volume condition in the health monitor controller and agent respectively?

xing-yang · 2021-01-25T18:37:54Z

/lgtm
/approve

k8s-ci-robot · 2021-01-25T18:38:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fengzixu, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xing-yang · 2021-01-25T18:46:21Z

/retest

xing-yang · 2021-01-25T18:55:33Z

/retest

xing-yang · 2021-01-26T01:49:59Z

/retest

xing-yang · 2021-01-26T02:55:58Z

/retest

k8s-ci-robot · 2021-01-26T02:58:04Z

@fengzixu: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-csi-csi-driver-host-path-1-17-on-kubernetes-1-17	`9e7831d`	link	`/test pull-kubernetes-csi-csi-driver-host-path-1-17-on-kubernetes-1-17`
pull-kubernetes-csi-csi-driver-host-path-1-20-on-kubernetes-1-20	`9e7831d`	link	`/test pull-kubernetes-csi-csi-driver-host-path-1-20-on-kubernetes-1-20`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

pohly · 2021-01-25T18:25:28Z

Dockerfile

@@ -4,6 +4,6 @@ LABEL description="HostPath Driver"
 ARG binary=./bin/hostpathplugin

 # Add util-linux to get a new version of losetup.
-RUN apk add util-linux
+RUN apk add util-linux && apk update && apk upgrade


How is this related to "support external-health-monitor"?

When I implement this feature, I found I need run findmnt command with jsonargument. It was supported in newer version. So, I update it

pohly · 2021-01-25T18:25:50Z

cmd/hostpathplugin/main.go

+		fmt.Printf("Failed to run driver: %s", err.Error())
+		os.Exit(1)
+
+	}


Same here? This looks like an unrelated enhancement.

An error may be returned from that function, just checking it right?

pohly · 2021-04-01T10:41:13Z

Updating E2E testing in Kubernetes from csi-driver-host-path 1.4.0 to 1.6.2 causes jobs to fail and the code from this PR seems to be involved, see kubernetes/kubernetes#100637 (comment).

pohly · 2021-04-01T10:43:08Z

Do we have any test for the discoveryExistingVolumes function?

pohly · 2021-04-01T11:50:07Z

pkg/hostpath/hostpath.go

+	for _, pv := range mountInfosOfPod.ContainerFileSystem {
+		if !strings.Contains(pv.Target, csiSignOfVolumeTargetPath) {
+			continue
+		}


How does this code distinguish between mounts from some other CSI driver and the mounts of this hostpath driver instance?

I think it doesn't.

Both podVolumeTargetPath = /var/lib/kubelet/pods and csiSignOfVolumeTargetPath = kubernetes.io~csi/pvc are shared with all other CSI driver instances.

Besides the obvious failure of adding volumes that aren't actually managed by this driver, we also get a race condition that leads to the "failed to get capacity info: no such file or directory" error from kubernetes/kubernetes#100637 (comment)

one CSI driver starts listing mounts

another CSI driver concurrently unmounts

the first CSI driver tries to get capacity information which then fails

Does that make sense?

How is discoveryExistingVolumes related to volume health checks?

I understand the desire to support container restarts. I just don't understand how it is related to health checks.

Regarding container restarts: the information restored is incomplete (for example, the more recently added nominal capacity of a volume cannot be discovered). Instead of making this code more complex, can we simply rip it out and go back to the state from the hostpath 1.4 release, i.e. the driver must not be restarted?

Same with discovering snapshots?

Regarding the implementation of this volume discovery: wouldn't it be better to use a JSON file in the dataDir to store the in-memory list each time changes are made to it? Then we can restore all information.

When testing upgrade for moving snapshot from v1beta1 to v1, for example, driver restart will happen. So we need to handle driver restarts.

I'm open to different ways of implementing this.

So you needed that for some other purpose. Now it makes more sense to me.

I started implementing this, but won't have time to finish it. If someone else wants to take a stab at it, feel free.

IMHO the code which Volume and Snaphshot structs and the corresponding maps should be in its own package, with get/add/update/delete functions that have to be used to make changes. Currently it's in the same package and I already found code which directly manipulated the maps instead of using the existing helper functions.

Then in that package, we can dump the entire struct into one file to store volumes and snapshots. That removes a whole lot of custom code, including the one which we discuss here. Combine that with a configurable data dir and we can add unit tests for this...

@fengzixu can you take a stab at this?

Sure. Let me try

b54c1ba Merge pull request kubernetes-csi#246 from xing-yang/go_1.21 5436c81 Change go version to 1.21.5 267b40e Merge pull request kubernetes-csi#244 from carlory/sig-storage b42e5a2 nominate self (carlory) as kubernetes-csi reviewer a17f536 Merge pull request kubernetes-csi#210 from sunnylovestiramisu/sidecar 011033d Use set -x instead of die 5deaf66 Add wrapper script for sidecar release git-subtree-dir: release-tools git-subtree-split: b54c1ba

Add wrapper script for sidecar release

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 18, 2020

k8s-ci-robot requested review from jsafrane and saad-ali October 18, 2020 23:50

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 18, 2020

fengzixu force-pushed the master branch from 3463581 to 53c6c3e Compare October 18, 2020 23:52

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2020

k8s-ci-robot assigned xing-yang Oct 19, 2020

irbekrm mentioned this pull request Oct 19, 2020

Add e2e tests for external health monitor controller kubernetes-csi/external-health-monitor#34

Closed

fengzixu force-pushed the master branch from 53c6c3e to c91dc14 Compare November 7, 2020 02:33

fengzixu changed the title ~~[WIP]feature: support external-heath-monitor~~ feature: support external-heath-monitor Nov 7, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2020

xing-yang reviewed Nov 7, 2020

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 15, 2020

fengzixu changed the title ~~feature: support external-heath-monitor~~ feature: support external-health-monitor Nov 15, 2020

fengzixu force-pushed the master branch from efd6457 to 40b9b25 Compare November 15, 2020 08:32

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 15, 2020

fengzixu force-pushed the master branch from 4022ae5 to f8512c0 Compare November 15, 2020 08:49

xing-yang reviewed Nov 15, 2020

View reviewed changes

pohly mentioned this pull request Jan 11, 2021

Update release tool #234

Merged

fengzixu mentioned this pull request Jan 20, 2021

fix: fix a bug of building csi-sanity kubernetes-csi/csi-release-tools#128

Merged

feature: support external-health-monitor

9e7831d

fengzixu force-pushed the master branch from 5e4fef5 to 9e7831d Compare January 20, 2021 11:25

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2021

k8s-ci-robot merged commit 774a959 into kubernetes-csi:master Jan 26, 2021

pohly mentioned this pull request Jan 27, 2021

update images, add Kubernetes 1.20, prepare for v1.5.0 release #238

Merged

pohly mentioned this pull request Apr 1, 2021

storage e2e: automate hostpath YAML updates, update sidecars but not driver kubernetes/kubernetes#100637

Merged

pohly reviewed Apr 1, 2021

View reviewed changes

pohly mentioned this pull request Apr 7, 2021

Data lost after reboot #251

Closed

pohly mentioned this pull request Apr 20, 2021

dump and restore internal state #277

Merged

pohly mentioned this pull request Jun 4, 2021

Failing tests: [sig-storage] CSI Volumes [Driver: csi-hostpath] * kubernetes/kubernetes#102452

Closed

TerryHowe pushed a commit to TerryHowe/csi-driver-host-path that referenced this pull request Oct 17, 2024

Merge pull request kubernetes-csi#210 from sunnylovestiramisu/sidecar

a17f536

Add wrapper script for sidecar release

feature: support external-health-monitor #210

feature: support external-health-monitor #210

Conversation

fengzixu commented Oct 18, 2020 • edited Loading

xing-yang commented Oct 19, 2020

xing-yang commented Oct 19, 2020

fengzixu commented Nov 7, 2020

xing-yang commented Nov 7, 2020

xing-yang commented Nov 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented Nov 15, 2020

NickrenREN commented Jan 20, 2021

fengzixu commented Jan 20, 2021

NickrenREN commented Jan 20, 2021

NickrenREN commented Jan 22, 2021

xing-yang commented Jan 25, 2021

xing-yang commented Jan 25, 2021

k8s-ci-robot commented Jan 25, 2021

xing-yang commented Jan 25, 2021

xing-yang commented Jan 25, 2021

xing-yang commented Jan 26, 2021

xing-yang commented Jan 26, 2021

k8s-ci-robot commented Jan 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Apr 1, 2021

pohly commented Apr 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fengzixu commented Oct 18, 2020 •

edited

Loading

k8s-ci-robot commented Jan 26, 2021 •

edited

Loading