-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows CSI Node DaemonSet is not Running/ErrImagePull #1001
Comments
Sorry, we checked in the node daemonset without actually releasing the image for it I added a disclaimer here that it's in pre-release state https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master/examples/kubernetes/windows#windows but ideally the chart shouldn't offer the option to install something that doesn't even exist yet. #957 should fix it, after that I will make a release for 1.2.1 manifest list that contains windows image /assign |
Hi, thanks for this info. We saw the option in the helm chart and we don't saw anything in the README.md that it is not finished yet. Now it's clear and we will wait on the new release. |
@wongma7 Hi, you released a new Container image that support Windows and Linux #957 , but make it really sense to put a Linux and Windows Layer in a single Container Image that supports multi OS? |
@dschunack the container runtime (docker) should only pull the image that corresponds to the OS your container is running on. even though on ECR or Docker Hub it says the image is 2+GB, your Linux Nodes won't bother to pull the 2 GB windows images, so there's no need to worry. Basically now the tag is referring to a manifest list or "fat manifest", which links to multiple images. Docker knows how to pull the image for your OS and ignore the other ones. https://docs.docker.com/registry/spec/manifest-v2-2/ For reference https://github.com/moby/moby/blob/7b9275c0da707b030e62c96b679a976f31f929d3/distribution/pull_v2_windows.go#L68 |
@wongma7 It looks like 1.2.1 was released but still windows nodes can't pull the image |
v1.3.0 in GCR (but not ECR) should have a windows image now. |
@wongma7 Updated my release with Helm Chart 2.3.0 and I can see Windows Image was successfully pulled on node:
But Pod is in Error/CrashLoopBackOff state. ebs-plugin containers fails
|
@iusergii is csi-proxy installed on the node? https://github.com/kubernetes-csi/csi-proxy check out https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master/examples/kubernetes/windows |
@wongma7 I haven't. Sorry, my bad. |
@wongma7 I was facing same issue, updated driver to
|
@vmasule can you share with me logs from the CSI driver on the same node as the example: kubectl get po -n kube-system --field-selector spec.nodeName=ip-192-168-47-235.us-west-2.compute.internal --selector app=ebs-csi-node kubectl logs ebs-csi-node-windows-vdvrl -n kube-system ebs-plugin |
@wongma7 Thanks for speedy response, following is I see in the logs. I hope this helps.
|
@wongma7 One thing I forgot to update is that, I fallowed all the step except building Is that version of CSI Proxy may be cause of above error? |
yes it could be an issue, versions older than v1.0.0 had some bugs for discovering devices/volumes on EC2 instances. What version is it? If it's possible, check the logs from csi-proxy itself, for example here they https://github.com/kubernetes-csi/csi-proxy#installation emit the logs to "\etc\kubernetes\logs\csi-proxy.log" |
BTW, let's try to figure out why the build isn't working in https://github.com/kubernetes-csi/csi-proxy/issues. since the binary isn't being distributed yet kubernetes-csi/csi-proxy#83 build needs to work for everyone (since go build with GOARCh and GOOS should work even if you are on an ARM mac) |
@wongma7 Sorry for the confusion, actually CSI-Proxy version is not old, my colleague already compiled and updated latest version. Below are the log from EC2 Instance where Pod was scheduled. It also show one error which related previous error from EBS driver.
Any idea what may be wrong here? Pls provide guidance. |
Does the driver have any earlier logs? If possible, could you bump its verbosity to at least --v=5 and restart it? https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/templates/node-windows.yaml#L68 This error from csi-proxy "failed GetVolumeIDFromTargetPath: error getting the volume for the mount" is expected under certain conditions. This happens when the driver checks whether something has already been mounted at |
OK I have a suspicion this is a bug introduced recently because now we check for existence of mount at the mount point (""c:\var\lib\kubelet\pods\d38a9777-330f-49ea-9316-a27cc295140a\volumes\kubernetes.io~csi\pvc-671cb275-d9b8-4682-84dd-34212d7a6997\mount"") but if that does not exist (and on windows, it is basically guaranteed not to exist since "mount" means creating a symlink) then we treat that as an error instead of ignoring the error. I am now working to reproduce and will get a bugfix out if this is the reason |
Thanks @wongma7, for your information I am using below helm chart Repo: https://kubernetes-sigs.github.io/aws-ebs-csi-driver I hope this help, apart from this I also have seen and just now reproduced below intermittent error
When I describe the node I can see those topology labels already present on that node. |
The node topology label issue sounds similar to #1030 , we have not solved the issue but can continue discussion there. Thanks for the information, I will comment again whe n I have results of my repro attempt |
OK, I reproduced the issue and submitted a fix here #1081. I intend to release v1.3.2 after it merges. Apologies for the trouble, I will comment again when the release is done. |
Thank you @wongma7 , I must say speed with which fix provided really surprised me in open source community. Will be watching for release notification. |
Hi @wongma7 I used the 2 month old docker image tag(
And below is the csi_proxy log on that windows node.
|
it looks like a different issue, would it be possible to retrieve what the partitions on that volume look like? For example, if you can access powershell on that node the csi-proxy logs are from Get-Disk -Number 7 | Get-Partition 1 17408 15.98 MB Reserved the workflow is slightly different between the first time the volume is used and second because for the first time the driver must format the volume but for the second time the driver must discover the partitions on the volume. So it is possible there is another bug in the discovery depending on how your disk looks. |
Actually, I terminated the node while ago but will reproduce it tomorrow and update here. |
Hi @wongma7, I reproduced what I promised yesterday, this what it looks like.
|
@vmasule do you have access to corresponding logs from the csi driver on that node? |
I think it was same what I shared yesterday, except for disk number, which is instead 7 it is 8 now. |
OK, I thought there are 2 different errors so I would expect the logs to be different, but we can focus on the original one because that is from the latest version of the driver error 1 from driver v1.3.0 "file does not exist"
for this error, I think it is fixed by #1081. If you are brave you may try the unstable/development version of the driver containing this fix at error 2 from driver 8c6c7e0 "volume id empty"
for this error, I may need CSI node logs to debug further, as the volume 8 looks fine to me from the above output so I am not sure how come the driver was not able to find it. |
@wongma7 , I can confirm the Error 1 above is now fixed, I used the public docker repo instead of But... I have reproduced the Error 2 again and below are the logs. Events from Application Pod
Logs and partition details from CSI Node
[Edit] Logs from CSI Driver on windows node(after changing log level to
How to reproduce? Let me know if you need anything else. This is really failing our Single AZ SLA for the application. Your quick opinion on this will be highly appreciated. Thanks. |
@vmasule thank you for the logs , unfortunately I couldn't reproduce the issue, for reference I am using this AMI . amazon/Windows_Server-2019-English-Full-EKS_Optimized-1.20-2021.09.16. Are you draining the node before you terminate the EC2 instance? Also, if possible could you get the output of the command like this:
it is erroring trying to get the volumes here https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/pkg/mounter/safe_mounter_windows.go#L263 BTW, I will close this issue and open another since we'll release v1.4.0 to address |
@wongma7, I am definitely not draining the node before termination, I had terminated the node manually from AWS Console, and its quite possible node can get terminated any time without much of time left for draining the node in certain scenarios. Sure you can close this ticket, when should we expect the release for Error 1? I will provide the logs tomorrow for error 2, and will post it in new ticket. Thanks for your help. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What happened?
Windows CSI Node DaemonSet is not running due to an wrong Image without windows support for windows/amd64.
Looks for me that it was not tested on EKS Windows Nodes.
What you expected to happen?
Running Windows CSI Node DaemonSet
How to reproduce it (as minimally and precisely as possible)?
Activate windows support in helm chart and spin up a Windows Node.
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/values.yaml#L133
Anything else we need to know?:
Environment
kubectl version
): 1.20The text was updated successfully, but these errors were encountered: