Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitForFirstConsumer PD stuck in pending #241

Closed
gregwebs opened this issue Dec 18, 2018 · 11 comments · Fixed by #248
Closed

WaitForFirstConsumer PD stuck in pending #241

gregwebs opened this issue Dec 18, 2018 · 11 comments · Fixed by #248
Labels
type/bug Something isn't working

Comments

@gregwebs
Copy link
Contributor

in gke-storage.yaml I add

volumeBindingMode: WaitForFirstConsumer

I create a new kubernetes cluster, and deploy tidb.

$ kubectl get pod -n tidb1

NAME                              READY   STATUS      RESTARTS   AGE
demo-monitor-5988ddc86c-mf2nz     2/2     Running     0          8m
demo-monitor-configurator-xgrd2   0/1     Completed   0          8m
demo-pd-0                         0/1     Pending     0          8m
demo-tidb-initializer-qfbgq       1/1     Running     0          8m
$ kubectl get pvc -n tidb1
NAME           STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pd-demo-pd-0   Pending                                      pd-ssd         8m

There are no pv. The pod shows a warning:

$ kubectl describe -n tidb1 pod demo-pd-0
...
Events:
  Type     Reason            Age                   From            Message
  ----     ------            ----                  ----            -------
  Warning  FailedScheduling  2m26s (x37 over 12m)  tidb-scheduler  0/3 nodes are available: 3 node(s) didn't find available persistent volumes to bind.
@gregwebs
Copy link
Contributor Author

I think this issue was pointed out here but never addressed.

@tennix
Copy link
Member

tennix commented Dec 19, 2018

We've removed the WaitForFirstConsumer volume binding mode for GKE pd-ssd in #130. And it works for the tutorial.

@gregwebs
Copy link
Contributor Author

The tutorial is single AZ. This issue is for a multi-AZ deploy which doesn't seem to work.

@tennix
Copy link
Member

tennix commented Dec 20, 2018

For GCE persistent disk, the volume binding mode should not be set to WaitForFirstConsumer, otherwise, the PVC would be in pending mode. This is for both single AZ and multi-AZ. And I've tested that after removing WaitForFirstConsumer volume binding mode, the PV can be created and pod can be scheduled correctly. And I also noted that for multi-AZ deploy, the statefulset schedules the pods across all the AZs.
So I think this is not a problem for multi-AZ deployment. You should remove WaitForFirstConsumer binding mode for the storage class.

@gregwebs
Copy link
Contributor Author

gregwebs commented Dec 20, 2018

Without WaitForFirstConsumer.

$ kubectl describe pod -n tidb1 demo-pd-1 | grep Warning
  Warning  FailedScheduling    2m14s (x3 over 2m17s)  tidb-scheduler                                pod has unbound PersistentVolumeClaims (repeated 2 times)
  Warning  FailedAttachVolume  55s (x8 over 2m9s)     attachdetach-controller                       AttachVolume.Attach failed for volume "pvc-730383e6-0480-11e9-a0f7-42010a8a0098" : GCE persistent disk not found: diskName="gke-beta-3f873039-dyna-pvc-730383e6-0480-11e9-a0f7-42010a8a0098" zone="us-west1-a"
  Warning  FailedMount         7s                     kubelet, gke-beta-default-pool-1363b1c3-vh39  Unable to mount volumes for pod "demo-pd-1_tidb1(730532bc-0480-11e9-a0f7-42010a8a0098)": timeout expired waiting for volumes to attach or mount for pod "tidb1"/"demo-pd-1". list of unmounted volumes=[pd]. list of unattached volumes=[pd annotations config startup-script default-token-pktgf]

$ gcloud compute disks list | grep gke-beta-3f873039-dyna-pvc-730383e6-0480-11e9-a0f7-42010a8a0098
gke-beta-3f873039-dyna-pvc-730383e6-0480-11e9-a0f7-42010a8a0098  us-west1-c  2        pd-ssd       READY

So here we see the disk is in us-west1-c. The pod fails to find it in us-west1-a.

@tennix
Copy link
Member

tennix commented Dec 21, 2018

Oh, this seems to the same issue as #180. It is fixed in Kubernetes 1.12

@tennix
Copy link
Member

tennix commented Dec 21, 2018

The latest GKE cluster is Kubernetes 1.11.5. To confirm it's fixed in 1.12, we should bring up a 1.12 cluster with kube-up script. And test the multi-AZ deployment.

@weekface
Copy link
Contributor

https://kubernetes.io/docs/setup/multiple-zones/#volume-limitations

The following limitations are addressed with topology-aware volume binding.

  • StatefulSet volume zone spreading when using dynamic provisioning is currently not compatible with pod affinity or anti-affinity policies.
  • If the name of the StatefulSet contains dashes (“-”), volume zone spreading may not provide a uniform distribution of storage across zones.
  • When specifying multiple PVCs in a Deployment or Pod spec, the StorageClass needs to be configured for a specific single zone, or the PVs need to be statically provisioned in a specific zone. Another workaround is to use a StatefulSet, which will ensure that all the volumes for a replica are provisioned in the same zone.

There are many other limitations with SatefulSet and PV.

@weekface
Copy link
Contributor

My mistake, these limitations are addressed.

The following limitations are addressed with topology-aware volume binding.

@gregwebs gregwebs added the type/bug Something isn't working label Dec 21, 2018
@tennix
Copy link
Member

tennix commented Dec 22, 2018

@gregwebs I've confirmed that this is a bug of our scheduler extender. On GKE the latest Kubernetes version is 1.11.5 right now, and the WaitForFirstConsumer volume binding mode cannot work correctly as the document said.
However, using Immediate volume binding mode normal can be scheduled correctly. But tidb-operator uses scheduler extender for pd/tikv/tidb pods, and both Immediate and WaitForFirstConsumer binding modes fail to schedule pods. After changing the scheduler to default-scheduler the pods can be scheduled correctly.

@weekface
Copy link
Contributor

Our extended scheduler's kube-scheduler policy config lacks a predicate: NoVolumeZoneConflict.

It works when id add it to the list.

The kube-scheduler default config is:

Creating scheduler with fit predicates 'map[NoVolumeZoneConflict:{} MaxEBSVolumeCount:{} MaxAzureDiskVolumeCount:{} NoDiskConflict:{} GeneralPredicates:{} PodToleratesNodeTaints:{} CheckVolumeBinding:{} MaxGCEPDVolumeCount:{} MatchInterPodAffinity:{} CheckNodeMemoryPressure:{} CheckNodeDiskPressure:{} CheckNodePIDPressure:{} CheckNodeCondition:{}]' and priority functions 'map[SelectorSpreadPriority:{} InterPodAffinityPriority:{} LeastRequestedPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{}]'

We should change our policy config to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants