Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Docker Swarm support #376

Merged
merged 26 commits into from
Feb 17, 2023
Merged

add Docker Swarm support #376

merged 26 commits into from
Feb 17, 2023

Conversation

s4ke
Copy link
Contributor

@s4ke s4ke commented Feb 7, 2023

ready for review

closes #374

@apricote apricote self-requested a review February 7, 2023 09:51
@s4ke
Copy link
Contributor Author

s4ke commented Feb 7, 2023

Currently hit a wall, volumes are not created, might be fixed by #3116

@s4ke
Copy link
Contributor Author

s4ke commented Feb 7, 2023

just for linking purposes: this PR relates to #374

@s4ke s4ke changed the title [DRAFT] add Docker Swarm support add Docker Swarm support Feb 13, 2023
@s4ke
Copy link
Contributor Author

s4ke commented Feb 13, 2023

@apricote I think this is ready for review. The only thing missing right now is the build pipeline. For this I was wondering whether we should simply reuse the Makefile but wrap it in a GitHub action. What do you think?

deploy/docker-swarm/pkg/config.json Show resolved Hide resolved
driver/node.go Outdated Show resolved Hide resolved
deploy/docker-swarm/README.md Show resolved Hide resolved
deploy/docker-swarm/README.md Outdated Show resolved Hide resolved
deploy/docker-swarm/README.md Outdated Show resolved Hide resolved
deploy/docker-swarm/README.md Outdated Show resolved Hide resolved
deploy/docker-swarm/pkg/Makefile Outdated Show resolved Hide resolved
cmd/aio/main.go Outdated Show resolved Hide resolved
@apricote
Copy link
Member

I think this is ready for review.

Done :) Thanks for all the work you are putting into this!

The only thing missing right now is the build pipeline. For this I was wondering whether we should simply reuse the Makefile but wrap it in a GitHub action. What do you think?

That sounds good to me, you need to add it the the workflows publish_on_master.yml and publish_on_tag.yml

@s4ke s4ke requested a review from apricote February 13, 2023 08:58
.github/workflows/publish_on_master.yml Outdated Show resolved Hide resolved
.github/workflows/publish_on_tag.yml Outdated Show resolved Hide resolved
driver/node.go Outdated Show resolved Hide resolved
deploy/docker-swarm/pkg/Makefile Outdated Show resolved Hide resolved
deploy/docker-swarm/pkg/Dockerfile Outdated Show resolved Hide resolved
deploy/docker-swarm/pkg/Makefile Outdated Show resolved Hide resolved
Co-authored-by: Julian Tölle <julian.toelle97@gmail.com>
@s4ke s4ke requested a review from apricote February 13, 2023 16:24
@apricote
Copy link
Member

@s4ke I will test this tomorrow and if everything looks good I will merge it :)

@s4ke
Copy link
Contributor Author

s4ke commented Feb 13, 2023

Awesome. At some point I guess it would make sense to add e2e tests for Swarm as well, but I guess this would be another PR :)

Comment on lines 51 to 59
func (s *NodeService) NodeStageVolume(ctx context.Context, req *proto.NodeStageVolumeRequest) (*proto.NodeStageVolumeResponse, error) {
return nil, status.Error(codes.Unimplemented, "not supported")
// while we dont do anything here, Swarm 23.0.1 might require this
return &proto.NodeStageVolumeResponse{}, nil
}

func (s *NodeService) NodeUnstageVolume(ctx context.Context, req *proto.NodeUnstageVolumeRequest) (*proto.NodeUnstageVolumeResponse, error) {
return nil, status.Error(codes.Unimplemented, "not supported")
// while we dont do anything here, Swarm 23.0.1 might require this
return &proto.NodeUnstageVolumeResponse{}, nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid implementing a hack for a bug in Moby/SwarmKit; moby/swarmkit#3116 will address this.

Copy link
Contributor Author

@s4ke s4ke Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neersighted assuming the PR will take some time to be merged / a new release will take a bit, in this case and if @apricote is fine with it, I would like to still merge it tbh. This way we can get some real world experience with this driver even though it is not perfect yet. The capabilities are hidden behind a featureflag that does not impact anyone besides new Swarm users.

Once the PR is merged, I will happily create another PR to remove the hack. I totally understand if we don't want to go ahead with this though, as this is not my decision to make :)

@apricote
Copy link
Member

apricote commented Feb 14, 2023

I am unable to successfully mount a volume to a service in my testing.

Reproduction Steps

  1. Checkout PR
  2. Build and push image
    cd deploy/docker-swarm/pkg
    make push PLUGIN_NAME=apricote/hetznercloud_hcloud-csi-driver
  3. Setup server and swarm
    ## prepare server
    hcloud server create --name csi-test-swarm --image ubuntu-22.04 --ssh-key julian.toelle --type cpx21
    hcloud server ssh csi-test-swarm
    # on server
    ## init swarm
    curl -fsSL https://get.docker.com -o get-docker.sh
    chmod +x get-docker.sh
    sh get-docker.sh
    docker swarm init
  4. Setup plugin (according to docs)
    docker plugin install --disable --alias hetznercloud/hcloud-csi-driver --grant-all-permissions apricote/hetznercloud_hcloud-csi-driver:dev-swarm
    docker plugin set hetznercloud/hcloud-csi-driver HCLOUD_TOKEN=<my token>
    docker plugin enable hetznercloud/hcloud-csi-driver
  5. Create Volume
    docker volume create --driver hetznercloud/hcloud-csi-driver --required-bytes 50G --type mount --sharing onewriter --scope single hcloud-debug1 --topology-required csi.hetzner.cloud/location=hel1
  6. Create Service
    docker service create --name hcloud-debug-serv1   --mount type=cluster,src=hcloud-debug1,dst=/srv/www   nginx:alpine

The last step hangs indefinitely. The log output (journalctl -u docker.service -f) shows that the call to ControllerPublishVolume fails:

Feb 14 11:31:03 csi-test-swarm dockerd[2056]: time="2023-02-14T11:31:03Z" level=info msg="level=debug ts=2023-02-14T11:31:03.343496798Z component=grpc-server msg=\"handling request\" method=/csi.v1.Controller/ControllerPublishVolume req=\"volume_id:\\\"28096177\\\" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > \"" plugin=7cd4cf53fdbf4b60ffbfc2933311183f6bff5164492e40039330156033b6159e
Feb 14 11:31:03 csi-test-swarm dockerd[2056]: time="2023-02-14T11:31:03Z" level=info msg="level=error ts=2023-02-14T11:31:03.343618005Z component=grpc-server msg=\"handler failed\" err=\"rpc error: code = InvalidArgument desc = missing node id\"" plugin=7cd4cf53fdbf4b60ffbfc2933311183f6bff5164492e40039330156033b6159e
Feb 14 11:31:03 csi-test-swarm dockerd[2056]: time="2023-02-14T11:31:03.344369531Z" level=info msg="error handling volume" attempt=0 error="error publishing or unpublishing to some nodes: [o1hvbglvshwwcstm6zzujcyyb]" module=csi/manager node.id=o1hvbglvshwwcstm6zzujcyyb volume.id=kv91gibx6xhtntntq7cqs9gjp

Further Info

Node ID was correctly retrieved from metadata:

Feb 14 11:28:06 csi-test-swarm dockerd[2056]: time="2023-02-14T11:28:06Z" level=info msg="level=info ts=2023-02-14T11:28:06.607759961Z msg=\"Fetched data from metadata service\" id=28867925 location=hel1" plugin=7cd4cf53fdbf4b60ffbfc2933311183f6bff5164492e40039330156033b6159e

Node has correct info set:

root@csi-test-swarm:~# docker node inspect o1hvbglvshwwcstm6zzujcyyb
[
    {
        "ID": "o1hvbglvshwwcstm6zzujcyyb",
        ...
        "CreatedAt": "2023-02-14T11:28:42.604637918Z",
        "UpdatedAt": "2023-02-14T11:28:43.209672806Z",
        ...
        "Description": {
            ...,
            "Engine": {
                "EngineVersion": "23.0.1",
                "Plugins": [
                    ...
                    {
                        "Type": "csicontroller",
                        "Name": "hetznercloud/hcloud-csi-driver:latest"
                    },
                    {
                        "Type": "csinode",
                        "Name": "hetznercloud/hcloud-csi-driver:latest"
                    }
                ]
            },
            ...
            "CSIInfo": [
                {
                    "PluginName": "hetznercloud/hcloud-csi-driver:latest",
                    "NodeID": "28867925",
                    "MaxVolumesPerNode": 16
                }
            ]
        },
        ...
    }
]

Volume is in state pending-publish

root@csi-test-swarm:~# docker volume inspect kv91gibx6xhtntntq7cqs9gjp
[
    {
        "ClusterVolume": {
            "ID": "kv91gibx6xhtntntq7cqs9gjp",
            "Version": {
                "Index": 35
            },
            "CreatedAt": "2023-02-14T11:28:48.7175303Z",
            "UpdatedAt": "2023-02-14T11:54:42.381566553Z",
            "Spec": {
                "AccessMode": {
                    "Scope": "single",
                    "Sharing": "onewriter",
                    "MountVolume": {}
                },
                "AccessibilityRequirements": {
                    "Requisite": [
                        {
                            "Segments": {
                                "csi.hetzner.cloud/location": "hel1"
                            }
                        }
                    ]
                },
                "CapacityRange": {
                    "RequiredBytes": 53687091200,
                    "LimitBytes": 0
                },
                "Availability": "active"
            },
            "PublishStatus": [
                {
                    "NodeID": "o1hvbglvshwwcstm6zzujcyyb",
                    "State": "pending-publish"
                }
            ],
            "Info": {
                "CapacityBytes": 53687091200,
                "VolumeID": "28096177",
                "AccessibleTopology": [
                    {
                        "Segments": {
                            "csi.hetzner.cloud/location": "hel1"
                        }
                    }
                ]
            }
        },
        "CreatedAt": "2023-02-14 11:28:48.7175303 +0000 UTC",
        "Driver": "hetznercloud/hcloud-csi-driver",
        "Labels": null,
        "Mountpoint": "",
        "Name": "hcloud-debug1",
        "Options": null,
        "Scope": "global"
    }
]

@s4ke
Copy link
Contributor Author

s4ke commented Feb 14, 2023

@apricote thanks for the detailed reproduction Steps. I will check what is wrong.

@s4ke
Copy link
Contributor Author

s4ke commented Feb 14, 2023

Here are some notes from debugging on my end. There seems to be some issue regarding CSI scheduling that occurs whenever you get into a broken state by creating a volume in a different location than your service is in. This then affects all volumes somehow and we have to clean the state by force removing all volumes and start from scratch.

In any case, this does not seem to be an error of the CSI plugin.

# working plugin, force removed everything, didnt restart docker

docker plugin install --disable --alias hetznercloud/hcloud-csi-driver --grant-all-permissions apricote/hetznercloud_hcloud-csi-driver:dev-swarm

docker plugin set hetznercloud/hcloud-csi-driver HCLOUD_TOKEN=<token>

docker plugin enable hetznercloud/hcloud-csi-driver

docker volume create --driver hetznercloud/hcloud-csi-driver --required-bytes 11G --type mount --sharing onewriter --scope single hcloud-nbg1 --topology-required csi.hetzner.cloud/location=nbg1
docker volume create --driver hetznercloud/hcloud-csi-driver --required-bytes 11G --type mount --sharing onewriter --scope single hcloud-hel1 --topology-required csi.hetzner.cloud/location=hel1

# pending creation (had plugin installed before, have to check with Docker team if this is expected behaviour)

martinb@csi-dev02:~/csi-driver$ docker volume ls --cluster
VOLUME NAME   GROUP     DRIVER                           AVAILABILITY   STATUS
hcloud-hel1             hetznercloud/hcloud-csi-driver   active         pending creation
hcloud-nbg1             hetznercloud/hcloud-csi-driver   active         pending creation

# restarting docker
# as root:
service docker restart

# creation worked
martinb@csi-dev02:~/csi-driver$ docker volume ls --cluster
VOLUME NAME   GROUP     DRIVER                           AVAILABILITY   STATUS
hcloud-hel1             hetznercloud/hcloud-csi-driver   active         created
hcloud-nbg1             hetznercloud/hcloud-csi-driver   active         created


docker service create --name hcloud-hel1-serv   --mount type=cluster,src=hcloud-hel1,dst=/srv/www   nginx:alpine
docker service create --name hcloud-nbg1-serv   --mount type=cluster,src=hcloud-nbg1,dst=/srv/www   nginx:alpine

# hel1 serv will never work, server is in nbg1
# nbg1 is not getting created due to hel1 being stuck

# remove service hel1
docker service rm hcloud-hel1-serv

# logs still show this:
Feb 14 13:35:21 csi-dev02 dockerd[6111]: time="2023-02-14T13:35:21.732310152Z" level=info msg="error handling volume" attempt=9 error="error publishing or unpublishing to some nodes: [t7v2vpe1q0ykbjw4pmgahfnxg]" module=csi/manager node.id=t7v2vpe1q0ykbjw4pmgahfnxg volume.id=w9bokbszki1emyhxiurvyy1db
[
    {
        "ClusterVolume": {
            "ID": "w9bokbszki1emyhxiurvyy1db",
            "Version": {
                "Index": 3130
            },
            "CreatedAt": "2023-02-14T13:29:53.099835784Z",
            "UpdatedAt": "2023-02-14T13:35:21.730974867Z",
            "Spec": {
                "AccessMode": {
                    "Scope": "single",
                    "Sharing": "onewriter",
                    "MountVolume": {}
                },
                "AccessibilityRequirements": {
                    "Requisite": [
                        {
                            "Segments": {
                                "csi.hetzner.cloud/location": "hel1"
                            }
                        }
                    ]
                },
                "CapacityRange": {
                    "RequiredBytes": 11811160064,
                    "LimitBytes": 0
                },
                "Availability": "active"
            },
            "PublishStatus": [
                {
                    "NodeID": "t7v2vpe1q0ykbjw4pmgahfnxg",
                    "State": "pending-publish"
                }
            ],
            "Info": {
                "CapacityBytes": 11811160064,
                "VolumeID": "28101771",
                "AccessibleTopology": [
                    {
                        "Segments": {
                            "csi.hetzner.cloud/location": "hel1"
                        }
                    }
                ]
            }
        },
        "CreatedAt": "2023-02-14 13:29:53.099835784 +0000 UTC",
        "Driver": "hetznercloud/hcloud-csi-driver",
        "Labels": null,
        "Mountpoint": "",
        "Name": "hcloud-hel1",
        "Options": null,
        "Scope": "global"
    }
]

docker volume rm -f hcloud-hel1
# nbg service still does not schedule

# restarting docker daemon as root:
service docker restart

# scale down service and scale it up again
docker service scale hcloud-nbg1-serv=0
docker service scale hcloud-nbg1-serv=1

# remove service and add it again (doesnt work)
docker service rm hcloud-nbg1-serv
docker service create --name hcloud-nbg1-serv   --mount type=cluster,src=hcloud-nbg1,dst=/srv/www   nginx:alpine

# error log:
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=debug ts=2023-02-14T13:43:03.03999663Z component=grpc-server msg=\"handling request\" method=/csi.v1.Node/NodeStageVolume req=\"volume_id:\\\"28101769\\\" staging_target_path:\\\"/data/staged/tzbqvonl31mz1ulvtqbi1toq5\\\" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > \"" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=debug ts=2023-02-14T13:43:03.040164719Z component=grpc-server msg=\"finished handling request\" method=/csi.v1.Node/NodeStageVolume err=null" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03.040652761Z" level=info msg="volume staged to path /data/staged/tzbqvonl31mz1ulvtqbi1toq5" attempt=8 module=node/agent/csi volume.id=tzbqvonl31mz1ulvtqbi1toq5
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03.040760708Z" level=debug msg="staging volume succeeded, attempting to publish volume" attempt=8 module=node/agent/csi volume.id=tzbqvonl31mz1ulvtqbi1toq5
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=debug ts=2023-02-14T13:43:03.041211408Z component=grpc-server msg=\"handling request\" method=/csi.v1.Node/NodePublishVolume req=\"volume_id:\\\"28101769\\\" staging_target_path:\\\"/data/staged/tzbqvonl31mz1ulvtqbi1toq5\\\" target_path:\\\"/data/published/tzbqvonl31mz1ulvtqbi1toq5\\\" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > \"" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350
Feb 14 13:43:03 csi-dev02 dockerd[773]: time="2023-02-14T13:43:03Z" level=info msg="level=error ts=2023-02-14T13:43:03.041281311Z component=grpc-server msg=\"handler failed\" err=\"rpc error: code = InvalidArgument desc = missing device path\"" plugin=45550d3efcd697d97d4e9e39d32b7a94d2e964cdef4ff0410433f0484c3f6350

# force remove everything that is left
docker service rm hcloud-nbg1-serv
docker volume rm -f hcloud-nbg1

# add nbg volume again (idempotency will use same volume from before)
docker volume create --driver hetznercloud/hcloud-csi-driver --required-bytes 11G --type mount --sharing onewriter --scope single hcloud-nbg1 --topology-required csi.hetzner.cloud/location=nbg1
# add service again
docker service create --name hcloud-nbg1-serv   --mount type=cluster,src=hcloud-nbg1,dst=/srv/www   nginx:alpine

# works

@s4ke
Copy link
Contributor Author

s4ke commented Feb 14, 2023

@apricote could it be the case that you had a similar situation where a "bad" volume caused the proper one not being scheduled?

@apricote
Copy link
Member

@apricote could it be the case that you had a similar situation where a "bad" volume caused the proper one not being scheduled?

I dont think so. I only ever created one volume in the cluster in hel1, same as the server. My volume status is also in use (1 node), but still not attached properly:

root@csi-test-swarm:~# docker volume ls --cluster
VOLUME NAME     GROUP     DRIVER                           AVAILABILITY   STATUS
hcloud-debug1             hetznercloud/hcloud-csi-driver   active         in use (1 node)

@s4ke
Copy link
Contributor Author

s4ke commented Feb 14, 2023

Can you try restarting the docker daemon (if you still have this node running)?

@apricote
Copy link
Member

Can you try restarting the docker daemon (if you still have this node running)?

After restarting the volume is now marked as published but its not attached to the server in the API.

Also I have multiple "names" for the plugin in my config:

  • Volume is managed by driver hetznercloud/hcloud-csi-driver
  • Node has CSIInfo for plugins hetznercloud/hcloud-csi-driver:latest & hetznercloud/hcloud-csi-driver
  • Plugin has name hetznercloud/hcloud-csi-driver:latest

I will try this whole process again tomorrow, just to make sure that I did not accidentally did something other than what I documented.

@s4ke
Copy link
Contributor Author

s4ke commented Feb 14, 2023

Thanks! I really appreciate your help with this!

@apricote
Copy link
Member

I tried it again today and it did not work.

I did manage to get it working by changing the driver reference to hetznercloud/hcloud-csi-driver:latest when creating the volume:

 docker volume create \
-  --driver hetznercloud/hcloud-csi-driver \
+  --driver hetznercloud/hcloud-csi-driver:latest \
   --required-bytes 50G \
   --type mount \
   --sharing onewriter \
   --scope single hcloud-debug1 \
   --topology-required csi.hetzner.cloud/location=hel1

No idea why docker starts referring to the plugin under another name than the one that I specified.

@s4ke
Copy link
Contributor Author

s4ke commented Feb 15, 2023

Ah great. I guess this is then an issue for moby/moby.

I will amend the examples in README.md to use the tag again (for now) and leave a link to https://github.com/olljanat/csi-plugins-for-docker-swarm for everyone to be aware of the current limits of docker's CSI implementation.

Then, we merge?

@s4ke s4ke requested a review from apricote February 16, 2023 16:58
@apricote apricote merged commit b492a97 into hetznercloud:main Feb 17, 2023
lukasmetzner added a commit that referenced this pull request Oct 18, 2024
)

With the merge of moby/swarmkit#3116 the mock
staging/unstaging is not needed anymore. It was introduced in our
csi-driver in this commit:
619fa5c.
Which is part of a large PR (squashed commit) that included experimental
support for Docker swarm.

See:
- #376 
- #382

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Docker Swarm
3 participants