Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Deploy CSI driver as a deployment and support multi master & leader e… #72

Closed
wants to merge 1 commit into from
Closed

[WIP] Deploy CSI driver as a deployment and support multi master & leader e… #72

wants to merge 1 commit into from

Conversation

chethanv28
Copy link
Collaborator

@chethanv28 chethanv28 commented Oct 2, 2019

What this PR does / why we need it:
This change is to enable leader election for syncer by using the implementation provided by csi (https://github.com/kubernetes-csi/external-provisioner/blob/master//vendor/github.com/kubernetes-csi/csi-lib-utils/leaderelection/leader_election.go
This change includes rbac and deployment yaml changes to enable leader election for vsphere-syncer, external-provisioner, and external-attacher.
With this change we will have HA enabled for the CSI driver.

Special notes for your reviewer:

Release note:

Enable leader election and HA for vsphere csi driver

Test Results (Sanity):

$ make test-e2e
hack/run-e2e-test.sh
go: finding github.com/onsi/ginkgo/ginkgo latest
Oct  2 12:32:14.586: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
Running Suite: CNS CSI Driver End-to-End Tests
==============================================
Random Seed: 1570044718 - Will randomize all specs
Will run 2 of 42 specs

SSSSSSSSSSSSS
------------------------------
[csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy 
  Verify dynamic provisioning of PV fails with user specified non-shared datastore and no storage policy specified in the storage class
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:82
[BeforeEach] [csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
  /Users/chethanv/go/pkg/mod/k8s.io/kubernetes@v1.14.2/test/e2e/framework/framework.go:149
STEP: Creating a kubernetes client
Oct  2 12:32:14.587: INFO: >>> kubeConfig: /Users/chethanv/.kube/config
STEP: Building a namespace api object, basename e2e-vsphere-volume-provisioning-no-storage-policy
Oct  2 12:32:14.677: INFO: Found PodSecurityPolicies; assuming PodSecurityPolicy is enabled.
Oct  2 12:32:14.832: INFO: Found ClusterRoles; assuming RBAC is enabled.
STEP: Binding the e2e-test-privileged-psp PodSecurityPolicy to the default service account in e2e-vsphere-volume-provisioning-no-storage-policy-6359
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:56
[It] Verify dynamic provisioning of PV fails with user specified non-shared datastore and no storage policy specified in the storage class
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:82
STEP: Invoking Test for user specified non-shared Datastore in storage class for volume provisioning
STEP: Creating StorageClass With scParameters: map[DatastoreURL:ds:///vmfs/volumes/5d8d2c33-aa1809eb-c8cd-0200027eba1c/] and allowedTopologies: []
STEP: Creating StorageClass With scParameters: map[DatastoreURL:ds:///vmfs/volumes/5d8d2c33-aa1809eb-c8cd-0200027eba1c/] and allowedTopologies: [] and ReclaimPolicy: 
STEP: Creating PVC using the Storage Class sc-ss95g with disk size  and labels: map[]
STEP: Expect claim to fail provisioning volume on non shared datastore
Oct  2 12:32:15.073: INFO: Waiting up to 30s for PersistentVolumeClaims [pvc-b4vzt] to have phase Bound
Oct  2 12:32:15.100: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:17.109: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:19.122: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:21.146: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:23.157: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:25.166: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:27.181: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:29.199: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:31.217: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:33.235: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:35.255: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:37.266: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:39.281: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:41.297: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Oct  2 12:32:43.308: INFO: PersistentVolumeClaim pvc-b4vzt found but phase is Pending instead of Bound.
Actual failure message: "failed to provision volume with StorageClass \"sc-ss95g\": rpc error: code = Internal desc = Failed to create volume. Error: DatastoreURL: ds:///vmfs/volumes/5d8d2c33-aa1809eb-c8cd-0200027eba1c/ specified in the storage class is not found."
Expected failure message: "failed to provision volume with StorageClass \"sc-ss95g\""
Oct  2 12:32:45.327: INFO: Deleting PersistentVolumeClaim "pvc-b4vzt"
[AfterEach] [csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
  /Users/chethanv/go/pkg/mod/k8s.io/kubernetes@v1.14.2/test/e2e/framework/framework.go:150
Oct  2 12:32:45.376: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "e2e-vsphere-volume-provisioning-no-storage-policy-6359" for this suite.
Oct  2 12:32:51.423: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct  2 12:32:51.679: INFO: namespace e2e-vsphere-volume-provisioning-no-storage-policy-6359 deletion completed in 6.287813192s

• [SLOW TEST:37.092 seconds]
[csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
/Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:47
  Verify dynamic provisioning of PV fails with user specified non-shared datastore and no storage policy specified in the storage class
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:82
------------------------------
SSS
------------------------------
[csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy 
  Verify dynamic provisioning of PV passes with user specified shared datastore and no storage policy specified in the storage class
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:67
[BeforeEach] [csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
  /Users/chethanv/go/pkg/mod/k8s.io/kubernetes@v1.14.2/test/e2e/framework/framework.go:149
STEP: Creating a kubernetes client
Oct  2 12:32:51.680: INFO: >>> kubeConfig: /Users/chethanv/.kube/config
STEP: Building a namespace api object, basename e2e-vsphere-volume-provisioning-no-storage-policy
STEP: Binding the e2e-test-privileged-psp PodSecurityPolicy to the default service account in e2e-vsphere-volume-provisioning-no-storage-policy-6733
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:56
[It] Verify dynamic provisioning of PV passes with user specified shared datastore and no storage policy specified in the storage class
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:67
STEP: Invoking Test for user specified Shared Datastore in Storage class for volume provisioning
STEP: Creating StorageClass With scParameters: map[DatastoreURL:ds:///vmfs/volumes/vsan:520b365f4aaebc4b-87e27fcb38c60f51/] and allowedTopologies: []
STEP: Creating StorageClass With scParameters: map[DatastoreURL:ds:///vmfs/volumes/vsan:520b365f4aaebc4b-87e27fcb38c60f51/] and allowedTopologies: [] and ReclaimPolicy: 
STEP: Creating PVC using the Storage Class sc-5nxdp with disk size  and labels: map[]
STEP: Expect claim to pass provisioning volume as shared datastore
Oct  2 12:32:51.878: INFO: Waiting up to 1m0s for PersistentVolumeClaims [pvc-jvhdf] to have phase Bound
Oct  2 12:32:51.892: INFO: PersistentVolumeClaim pvc-jvhdf found but phase is Pending instead of Bound.
Oct  2 12:32:53.902: INFO: PersistentVolumeClaim pvc-jvhdf found but phase is Pending instead of Bound.
Oct  2 12:32:55.918: INFO: PersistentVolumeClaim pvc-jvhdf found but phase is Pending instead of Bound.
Oct  2 12:32:57.929: INFO: PersistentVolumeClaim pvc-jvhdf found and phase=Bound (6.050973955s)
Oct  2 12:32:57.929: INFO: Deleting PersistentVolumeClaim "pvc-jvhdf"
[AfterEach] [csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
  /Users/chethanv/go/pkg/mod/k8s.io/kubernetes@v1.14.2/test/e2e/framework/framework.go:150
Oct  2 12:32:57.969: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "e2e-vsphere-volume-provisioning-no-storage-policy-6733" for this suite.
Oct  2 12:33:04.015: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct  2 12:33:04.341: INFO: namespace e2e-vsphere-volume-provisioning-no-storage-policy-6733 deletion completed in 6.359017404s

• [SLOW TEST:12.661 seconds]
[csi-block-e2e-1] Datastore Based Volume Provisioning With No Storage Policy
/Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:47
  Verify dynamic provisioning of PV passes with user specified shared datastore and no storage policy specified in the storage class
  /Users/chethanv/Downloads/csi-external/vsphere-csi-driver/tests/e2e/vsphere_shared_datastore.go:67
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSS
Ran 2 of 42 Specs in 49.754 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 40 Skipped
PASS

Ginkgo ran 1 suite in 1m5.832595331s
Test Suite Passed

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 2, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chethanv28
To complete the pull request process, please assign frapposelli
You can assign the PR to them by writing /assign @frapposelli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 2, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @chethanv28. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 2, 2019
@chethanv28 chethanv28 changed the title Deploy CSI driver as a deployment and support multi master & leader e… [WIP] Deploy CSI driver as a deployment and support multi master & leader e… Oct 2, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2019
@@ -29,9 +26,11 @@ spec:
- name: csi-attacher
image: quay.io/k8scsi/csi-attacher:v1.1.1
args:
- "--v=4"
- "--v=5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log level 4 should be sufficient. for attacher and provisioner container.

replicas: 1
updateStrategy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateStrategy -> RollingUpdate will be useful later for upgrade. Can you keep that?

@chethanv28 chethanv28 changed the title [WIP] Deploy CSI driver as a deployment and support multi master & leader e… Deploy CSI driver as a deployment and support multi master & leader e… Oct 2, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2019
@chethanv28 chethanv28 changed the title Deploy CSI driver as a deployment and support multi master & leader e… [WIP] Deploy CSI driver as a deployment and support multi master & leader e… Oct 3, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2019
@RaunakShah
Copy link
Contributor

Shouldn't we add new files at manifests/1.15/ instead of replacing manifests/1.14?

}

if !*enableLeaderElection {
run(context.TODO())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In main() it is generally expected that you will use context.Background() since there is no higher level to provide one. context.TODO() is intended for places where a context should be passed in, but the plumbing hasn't been done yet.

} else {
k8sClient, err := k8s.NewClient()
if err != nil {
klog.Errorf("Creating Kubernetes client failed. Err: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: you can use klog.Exitf() instead of Errorf followed by Exit

le := leaderelection.NewLeaderElection(k8sClient, lockName, run)

if err := le.Run(); err != nil {
klog.Fatalf("Error initializing leader election: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Fatalf() prints a stack trace, and Exitf() does not. Exitf() may be more appropriate here.

klog.Errorf("Error initializing Metadata Syncer")
os.Exit(1)
run := func(ctx context.Context) {
if err := metadataSyncer.Init(); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is metadataSyncer.Init() a blocking call that runs the syncer forever?

Comment on lines +40 to +45
run := func(ctx context.Context) {
if err := metadataSyncer.Init(); err != nil {
klog.Errorf("Error initializing Metadata Syncer")
os.Exit(1)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context that gets passed into your run func will be canceled when this process loses leadership. It looks like you don't pass that into metadataSyncer.Init so we will have no way to tear things down if we lose leadership.

In most core k8s components, we react to loss of leadership with os.Exit() so that the container gets restarted cleanly.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2020
@k8s-ci-robot
Copy link
Contributor

@chethanv28: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2020
@divyenpatel
Copy link
Member

Required change for to enable leader election and HA for vsphere csi driver is included in #151

@divyenpatel divyenpatel closed this Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants