Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Swarm mode #19

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open

Fix for Swarm mode #19

wants to merge 5 commits into from

Conversation

RomRider
Copy link

An attempt to fix the swarm mode. Can't test the impact for standard docker.

Ugly fix in add_chain() to get access to the new network namespace when no previous containers where running on the node on the targeted $NETWORK. Not sure if this is a kernel bug or an expected behaviour...
The error would be nsenter: setns(): can't reassociate to namespace 'net': Invalid argument

Fixes #18

@kaysond kaysond changed the base branch from master to develop February 12, 2025 20:58
@kaysond
Copy link
Owner

kaysond commented Feb 12, 2025

Thanks for the PR! I switched it to the develop branch to trigger the CI which is failing because of some shellcheck issues and it looks like some unbound vars. (PS - I changed the docker version back to what it is in dev; I'll bump it in a separate PR before generating a new release).

For the network namespace issue, the swarm worker service is running on the host namespace, so it shouldn't have a problem entering the namespace. I suspect that what's happening is the namespace its trying to enter is either empty or doesn't exist. Can you post some logs from when that happens? My concern with the hack fix is that it does not distinguish between error types. It kills the container no matter what.

@RomRider
Copy link
Author

RomRider commented Feb 13, 2025

Thanks, I'll fix some of those issues.

Regarding the issue with the namespace: When the namespace is created after the container is created (that happens if there's nothing running on a specific node on the defined network at the time the trafficjam worker container is starting), it seems that it can't access it unless the container is restarted.

I've sh-ed into the worker container and running nsenter manually on the new namespace fails all the time with:
nsenter: setns(): can't reassociate to namespace 'net': Invalid argument
Restarting the container and using the same command again makes it work.

The way I make it fail is:

  • drain a node
  • make the node available again
  • rm the trafficjam_DEFAULT service (might be optional, but to be sure...)
  • this node is now only running trafficjam worker (and other global services)
  • deploy something on the node on the defined $NETWORK by setting the affinity
  • watch the logs of the trafficjam worker running on this node/open a shell and run the nsenter command manually to verify

@kaysond
Copy link
Owner

kaysond commented Feb 13, 2025

Looks like CI is still failing. I changed the repo settings so hopefully it will run automatically with any commit, instead of requiring my approval.

Re: namespace - what OS and kernel version are you running docker on? Also - when the namespace gets created, and your shell command in the container fails, do you see the correct namespace listed in /var/run/netns in the -container-? I'm wondering if it's just that the bind mount isn't updating for some reason. Maybe related to this: https://forums.docker.com/t/unable-to-check-docker-overlay-network-namespace/17267

I'm wondering if you'd see the same issue without swarm if you started trafficjam before the network gets created (if the code didn't consider it an error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker swarm mode doesn't seem to work
2 participants