-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Docker Swarm support #376
Conversation
Currently hit a wall, volumes are not created, might be fixed by #3116 |
just for linking purposes: this PR relates to #374 |
@apricote I think this is ready for review. The only thing missing right now is the build pipeline. For this I was wondering whether we should simply reuse the Makefile but wrap it in a GitHub action. What do you think? |
Done :) Thanks for all the work you are putting into this!
That sounds good to me, you need to add it the the workflows |
Co-authored-by: Julian Tölle <julian.toelle97@gmail.com>
@s4ke I will test this tomorrow and if everything looks good I will merge it :) |
Awesome. At some point I guess it would make sense to add e2e tests for Swarm as well, but I guess this would be another PR :) |
func (s *NodeService) NodeStageVolume(ctx context.Context, req *proto.NodeStageVolumeRequest) (*proto.NodeStageVolumeResponse, error) { | ||
return nil, status.Error(codes.Unimplemented, "not supported") | ||
// while we dont do anything here, Swarm 23.0.1 might require this | ||
return &proto.NodeStageVolumeResponse{}, nil | ||
} | ||
|
||
func (s *NodeService) NodeUnstageVolume(ctx context.Context, req *proto.NodeUnstageVolumeRequest) (*proto.NodeUnstageVolumeResponse, error) { | ||
return nil, status.Error(codes.Unimplemented, "not supported") | ||
// while we dont do anything here, Swarm 23.0.1 might require this | ||
return &proto.NodeUnstageVolumeResponse{}, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid implementing a hack for a bug in Moby/SwarmKit; moby/swarmkit#3116 will address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neersighted assuming the PR will take some time to be merged / a new release will take a bit, in this case and if @apricote is fine with it, I would like to still merge it tbh. This way we can get some real world experience with this driver even though it is not perfect yet. The capabilities are hidden behind a featureflag that does not impact anyone besides new Swarm users.
Once the PR is merged, I will happily create another PR to remove the hack. I totally understand if we don't want to go ahead with this though, as this is not my decision to make :)
I am unable to successfully mount a volume to a service in my testing. Reproduction Steps
The last step hangs indefinitely. The log output (
Further InfoNode ID was correctly retrieved from metadata:
Node has correct info set: root@csi-test-swarm:~# docker node inspect o1hvbglvshwwcstm6zzujcyyb
[
{
"ID": "o1hvbglvshwwcstm6zzujcyyb",
...
"CreatedAt": "2023-02-14T11:28:42.604637918Z",
"UpdatedAt": "2023-02-14T11:28:43.209672806Z",
...
"Description": {
...,
"Engine": {
"EngineVersion": "23.0.1",
"Plugins": [
...
{
"Type": "csicontroller",
"Name": "hetznercloud/hcloud-csi-driver:latest"
},
{
"Type": "csinode",
"Name": "hetznercloud/hcloud-csi-driver:latest"
}
]
},
...
"CSIInfo": [
{
"PluginName": "hetznercloud/hcloud-csi-driver:latest",
"NodeID": "28867925",
"MaxVolumesPerNode": 16
}
]
},
...
}
] Volume is in state
|
@apricote thanks for the detailed reproduction Steps. I will check what is wrong. |
Here are some notes from debugging on my end. There seems to be some issue regarding CSI scheduling that occurs whenever you get into a broken state by creating a volume in a different location than your service is in. This then affects all volumes somehow and we have to clean the state by force removing all volumes and start from scratch. In any case, this does not seem to be an error of the CSI plugin. # working plugin, force removed everything, didnt restart docker
docker plugin install --disable --alias hetznercloud/hcloud-csi-driver --grant-all-permissions apricote/hetznercloud_hcloud-csi-driver:dev-swarm
docker plugin set hetznercloud/hcloud-csi-driver HCLOUD_TOKEN=<token>
docker plugin enable hetznercloud/hcloud-csi-driver
docker volume create --driver hetznercloud/hcloud-csi-driver --required-bytes 11G --type mount --sharing onewriter --scope single hcloud-nbg1 --topology-required csi.hetzner.cloud/location=nbg1
docker volume create --driver hetznercloud/hcloud-csi-driver --required-bytes 11G --type mount --sharing onewriter --scope single hcloud-hel1 --topology-required csi.hetzner.cloud/location=hel1
# pending creation (had plugin installed before, have to check with Docker team if this is expected behaviour)
martinb@csi-dev02:~/csi-driver$ docker volume ls --cluster
VOLUME NAME GROUP DRIVER AVAILABILITY STATUS
hcloud-hel1 hetznercloud/hcloud-csi-driver active pending creation
hcloud-nbg1 hetznercloud/hcloud-csi-driver active pending creation
# restarting docker
# as root:
service docker restart
# creation worked
martinb@csi-dev02:~/csi-driver$ docker volume ls --cluster
VOLUME NAME GROUP DRIVER AVAILABILITY STATUS
hcloud-hel1 hetznercloud/hcloud-csi-driver active created
hcloud-nbg1 hetznercloud/hcloud-csi-driver active created
docker service create --name hcloud-hel1-serv --mount type=cluster,src=hcloud-hel1,dst=/srv/www nginx:alpine
docker service create --name hcloud-nbg1-serv --mount type=cluster,src=hcloud-nbg1,dst=/srv/www nginx:alpine
# hel1 serv will never work, server is in nbg1
# nbg1 is not getting created due to hel1 being stuck
# remove service hel1
docker service rm hcloud-hel1-serv
# logs still show this:
Feb 14 13:35:21 csi-dev02 dockerd[6111]: time="2023-02-14T13:35:21.732310152Z" level=info msg="error handling volume" attempt=9 error="error publishing or unpublishing to some nodes: [t7v2vpe1q0ykbjw4pmgahfnxg]" module=csi/manager node.id=t7v2vpe1q0ykbjw4pmgahfnxg volume.id=w9bokbszki1emyhxiurvyy1db
[
{
"ClusterVolume": {
"ID": "w9bokbszki1emyhxiurvyy1db",
"Version": {
"Index": 3130
},
"CreatedAt": "2023-02-14T13:29:53.099835784Z",
"UpdatedAt": "2023-02-14T13:35:21.730974867Z",
"Spec": {
"AccessMode": {
"Scope": "single",
"Sharing": "onewriter",
"MountVolume": {}
},
"AccessibilityRequirements": {
"Requisite": [
{
"Segments": {
"csi.hetzner.cloud/location": "hel1"
}
}
]
},
"CapacityRange": {
"RequiredBytes": 11811160064,
"LimitBytes": 0
},
"Availability": "active"
},
"PublishStatus": [
{
"NodeID": "t7v2vpe1q0ykbjw4pmgahfnxg",
"State": "pending-publish"
}
],
"Info": {
"CapacityBytes": 11811160064,
"VolumeID": "28101771",
"AccessibleTopology": [
{
"Segments": {
"csi.hetzner.cloud/location": "hel1"
}
}
]
}
},
"CreatedAt": "2023-02-14 13:29:53.099835784 +0000 UTC",
"Driver": "hetznercloud/hcloud-csi-driver",
"Labels": null,
"Mountpoint": "",
"Name": "hcloud-hel1",
"Options": null,
"Scope": "global"
}
]
docker volume rm -f hcloud-hel1
# nbg service still does not schedule
# restarting docker daemon as root:
service docker restart
# scale down service and scale it up again
docker service scale hcloud-nbg1-serv=0
docker service scale hcloud-nbg1-serv=1
# remove service and add it again (doesnt work)
docker service rm hcloud-nbg1-serv
docker service create --name hcloud-nbg1-serv --mount type=cluster,src=hcloud-nbg1,dst=/srv/www nginx:alpine
# error log:
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=debug ts=2023-02-14T13:43:03.03999663Z component=grpc-server msg=\"handling request\" method=/csi.v1.Node/NodeStageVolume req=\"volume_id:\\\"28101769\\\" staging_target_path:\\\"/data/staged/tzbqvonl31mz1ulvtqbi1toq5\\\" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > \"" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=debug ts=2023-02-14T13:43:03.040164719Z component=grpc-server msg=\"finished handling request\" method=/csi.v1.Node/NodeStageVolume err=null" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03.040652761Z" level=info msg="volume staged to path /data/staged/tzbqvonl31mz1ulvtqbi1toq5" attempt=8 module=node/agent/csi volume.id=tzbqvonl31mz1ulvtqbi1toq5
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03.040760708Z" level=debug msg="staging volume succeeded, attempting to publish volume" attempt=8 module=node/agent/csi volume.id=tzbqvonl31mz1ulvtqbi1toq5
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=debug ts=2023-02-14T13:43:03.041211408Z component=grpc-server msg=\"handling request\" method=/csi.v1.Node/NodePublishVolume req=\"volume_id:\\\"28101769\\\" staging_target_path:\\\"/data/staged/tzbqvonl31mz1ulvtqbi1toq5\\\" target_path:\\\"/data/published/tzbqvonl31mz1ulvtqbi1toq5\\\" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > \"" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=error ts=2023-02-14T13:43:03.041281311Z component=grpc-server msg=\"handler failed\" err=\"rpc error: code = InvalidArgument desc = missing device path\"" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350
# force remove everything that is left
docker service rm hcloud-nbg1-serv
docker volume rm -f hcloud-nbg1
# add nbg volume again (idempotency will use same volume from before)
docker volume create --driver hetznercloud/hcloud-csi-driver --required-bytes 11G --type mount --sharing onewriter --scope single hcloud-nbg1 --topology-required csi.hetzner.cloud/location=nbg1
# add service again
docker service create --name hcloud-nbg1-serv --mount type=cluster,src=hcloud-nbg1,dst=/srv/www nginx:alpine
# works |
@apricote could it be the case that you had a similar situation where a "bad" volume caused the proper one not being scheduled? |
I dont think so. I only ever created one volume in the cluster in
|
Can you try restarting the docker daemon (if you still have this node running)? |
After restarting the volume is now marked as Also I have multiple "names" for the plugin in my config:
I will try this whole process again tomorrow, just to make sure that I did not accidentally did something other than what I documented. |
Thanks! I really appreciate your help with this! |
I tried it again today and it did not work. I did manage to get it working by changing the driver reference to docker volume create \
- --driver hetznercloud/hcloud-csi-driver \
+ --driver hetznercloud/hcloud-csi-driver:latest \
--required-bytes 50G \
--type mount \
--sharing onewriter \
--scope single hcloud-debug1 \
--topology-required csi.hetzner.cloud/location=hel1 No idea why docker starts referring to the plugin under another name than the one that I specified. |
Ah great. I guess this is then an issue for moby/moby. I will amend the examples in README.md to use the tag again (for now) and leave a link to https://github.com/olljanat/csi-plugins-for-docker-swarm for everyone to be aware of the current limits of docker's CSI implementation. Then, we merge? |
) With the merge of moby/swarmkit#3116 the mock staging/unstaging is not needed anymore. It was introduced in our csi-driver in this commit: 619fa5c. Which is part of a large PR (squashed commit) that included experimental support for Docker swarm. See: - #376 - #382 ---------
ready for review
closes #374