Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't start container after system upgrade due to permission error #1414

Closed
oblitum opened this issue Feb 27, 2020 · 21 comments
Closed

Can't start container after system upgrade due to permission error #1414

oblitum opened this issue Feb 27, 2020 · 21 comments

Comments

@oblitum
Copy link

oblitum commented Feb 27, 2020

Context

Today I upgraded my host system (it's an Arch Linux on ARM64 RPi3), it upgraded the kernel (to 5.5.6) and other stuff, this generally has not been an issue at all, but today after that I couldn't start the docker-mailserver anymore, it exits with this error:

❯ sudo docker logs mail -f
Traceback (most recent call last):
  File "/usr/bin/supervisord", line 11, in <module>
    load_entry_point('supervisor==3.3.1', 'console_scripts', 'supervisord')()
  File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 365, in main
    go(options)
  File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 375, in go
    d.main()
  File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 70, in main
    rlimit_messages = self.options.set_rlimits()
  File "/usr/lib/python2.7/dist-packages/supervisor/options.py", line 1372, in set_rlimits
    soft, hard = resource.getrlimit(res)
resource.error: (1, 'Operation not permitted')

I tried to add - ALL, - SYS_RESOURCE, etc to cap_add in docker-compose.yml, to check whether it was due to any new capability I needed to list, but it didn't help.

Expected Behavior

The container to start.

Actual Behavior

Container doesn't start due to permission error with getrlimit for RLIMIT_NOFILE.

Steps to Reproduce

  1. Have host system as Arch Linux on ARM
  2. Upgrade
  3. Run docker-mailserver built from master

Your Environment

  • Amount of RAM available: 1GB
  • Mailserver version used: built from master
  • Docker version used: 1:19.03.6-1
  • Environment settings relevant to the config:
  • Any relevant stack traces ("Full trace" preferred):
@oblitum
Copy link
Author

oblitum commented Feb 27, 2020

Running the container with privileged: true avoids the issue. Should this be necessary now?

@erik-wramner
Copy link
Contributor

Not always, but it might be necessary depending on your options. A quick search finds containers/podman#2123 with podman if you are using that.

I'm very much against running privileged (unless needed, for example by fail2ban) so if this needs changes I'm all for them. I'm running on a modern system without privileged though.

@oblitum
Copy link
Author

oblitum commented Feb 29, 2020

I'm running docker.

@oblitum
Copy link
Author

oblitum commented Feb 29, 2020

Not always, but it might be necessary depending on your options

These are my settings:

version: "2"
services:
  mail:
    image: docker-mailserver:latest
    container_name: mail
    ports:
      - "25:25"
      - "465:465"
      - "587:587"
      - "143:143"
      - "993:993"
    volumes:
      - maildata:/var/mail
      - mailstate:/var/mail-state
      - maillogs:/var/log/mail
      - /etc/localtime:/etc/localtime:ro
      - /etc/letsencrypt/:/etc/letsencrypt/:ro
      - ./config/:/tmp/docker-mailserver/
    environment:
      - ONE_DIR=1
      - ENABLE_FAIL2BAN=1
      - ENABLE_POSTGREY=1
      - SPOOF_PROTECTION=1
      - SSL_TYPE=letsencrypt
      - ENABLE_SPAMASSASSIN=1
      - SA_SPAM_SUBJECT=[SPAM]
      - POSTFIX_MESSAGE_SIZE_LIMIT=51200000
      - OVERRIDE_HOSTNAME=mail.mydomain.com
    cap_add:
      - NET_ADMIN
      - SYS_PTRACE
    dns:
      - "1.1.1.1"
      - "1.0.0.1"
    network_mode: "none"
    privileged: true
volumes:
  maildata:
    driver: local
  mailstate:
    driver: local
  maillogs:
    driver: local

I use network_mode: "none" because I set up a interface in the network namespace of the container for it alone, but I think this part is not relevant to the issue at hand.

@erik-wramner
Copy link
Contributor

Not 100% sure but I think you need privileged for fail2ban. It makes sense as you need to modify the firewall rules and they apply to the host as well. Doing that is certainly a root operation.

If you like you can try without fail2ban and without network_mode temporarily to see if you still need privileged without them.

@oblitum
Copy link
Author

oblitum commented Mar 2, 2020

I've disabled fail2ban but it has no effect. If you check the single stacktrace log I get in the former report, it happens with supervisord when calling resource.getrlimit for RLIMIT_NOFILE. This happens early on, I think it's unrelated to fail2ban and a more specific issue with getrlimit syscall permission.

FWIW, I've disabled fail2ban for the moment as I think it doesn't make sense running bundled in the container while it's serving through a VPN with port forwarding.

@erik-wramner
Copy link
Contributor

See TelegramMessenger/MTProxy#7. They have two better solutions than using privileged:

  • Pass --ulimit nofile=98304:98304 (or use some other numbers) to docker
  • Use --cap-add SYS_RESOURCE

Not sure why this hits you specifically, I haven't seen it. May be related to Arch Linux or to your configuration (kernel settings). Anyway, test the changes above and see if one of them helps!

@oblitum
Copy link
Author

oblitum commented Mar 6, 2020

@erik-wramner thx for the support. Sadly I had already tried SYS_RESOURCE as reported initially and it didn't work, I just tried it again while setting ulimits, still no deal. Notice though that I was not expecting it to work in fact because these permissions seem relevant for modifying limits, but I hit a wall even prior to that, on read, with getrlimit.

@erik-wramner
Copy link
Contributor

Right. I don't know what network_mode=none does, but apart from that your setup seems to be fine except that you are running ARM. That is a bit exotic so perhaps you have hit a weak spot there. Since I can't reproduce, how about asking the people involved in #1092?

@oblitum
Copy link
Author

oblitum commented Mar 6, 2020

I won't negate that running it on Arch Linux on 64bit ARMv8 is quite exotic :) I may be the only one running this setup as that thread talks mostly about ARMv7 on standard Raspbian (Debian for RPi). I'll ask there as soon as I dig this I bit more. I'll try to call getrlimit on a standalone python docker container to see whether I can reproduce it reduced to a single syscall from python and do the same from x86_64.

@oblitum
Copy link
Author

oblitum commented Mar 6, 2020

OK, this is the test bench:

  • Python 2.7.13 (as in docker-mailserver/supervisord), Debian Stretch, Arch Linux x86_64 as host:
    ❯ sudo docker run -it --rm python:2.7.13-stretch python -c 'import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))'
    (1048576, 1048576)
    
  • Python 2.7.13 (as in docker-mailserver/supervisord), Debian Stretch, Arch Linux aarch64 (RPi) as host:
    ❯ sudo docker run -it --rm python:2.7.13-stretch python -c 'import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    resource.error: (1, 'Operation not permitted')
    
  • Python 3.8.2 (newer than docker-mailserver/supervisord), Debian Buster, Arch Linux aarch64 (RPi) as host:
    ❯ sudo docker run -it --rm python:3 python -c 'import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))'
    (1048576, 1048576)
    

This makes me have a strong suspicion (like 90%) that this is a bug in old version of python, on this environment, similar to this.

The three main reasons to think this is the cause are:

  • There's a permission error on read access which no capability flag is able to fix, not even the documented capabilities for RLIMIT access.
  • Newer versions of the same python library syscall does not present any such permission error.
  • There's prior reports on the same RLIMIT syscall failing on read access in macOS.

I think the solution here is that docker-mailserver needs to eventually have these deps upgraded to solve the problem.

@erik-wramner
Copy link
Contributor

Have you checked the next branch? I guess you are building from source anyway? The next branch works but needs testing and it uses Debian Buster. Perhaps it solves your problem?

Unfortunately I can't manage our Docker builds, so I'm stuck with latest and stable, otherwise there would be an official next image as well. There is one in my own Docker repo.

@oblitum
Copy link
Author

oblitum commented Mar 8, 2020

I have not, but want to. I'm almost certain it will evade this issue (unless it's still using the same python version). The only thing that stops me from doing that is whether I'll be missing any of the features I rely on in the current version. Can you say whether there will be something missing given my configuration above? Some tool that isn't working?

@erik-wramner
Copy link
Contributor

The only thing I know to be absent in the new version is filebeat. That should be handled by an external container (simply mount the same logs folder in both). The ripole program is unfortunately gone as well, which means amavis can't handle ancient OLE2 .doc files (unless there is another mechanism for that, not sure).

@oblitum
Copy link
Author

oblitum commented Mar 15, 2020

Okay. I already build filebeat separately in current version. I will try the next branch.

@oblitum
Copy link
Author

oblitum commented Mar 15, 2020

Is it worth of having filebeat running in another docker instance instead of building it and running in the same instance? I was running it fine before in the same instance while building it from sources.

@oblitum
Copy link
Author

oblitum commented Mar 15, 2020

Given the next branch is not yet stable, I'd like to suggest to rewrite story and make filebeat remotion part a separate commit, so that it can be easy for those that wish to maintain it in the same container to revert the change. Sadly currently that change is mixed with the buster upgrade in a single commit.

@erik-wramner
Copy link
Contributor

I'd rather focus on getting the next release stable and getting it out. Or getting it out and then getting it stable (the latest branch is assumed to be less stable than the stable branch after all).

@oblitum
Copy link
Author

oblitum commented Mar 15, 2020

Note was not much about stability but more about concerning users in case you were open to rewrite.

@oblitum
Copy link
Author

oblitum commented Mar 15, 2020

Anyway, I've picked buster changes customized in a single commit, they're ready, but I'm going to build it next week, b/c I'll get better broadband speed. Docker builds are not great on medium broadband.

@oblitum
Copy link
Author

oblitum commented Mar 16, 2020

I have built my AArch64 buster image, and confirming, this indeed gets fixed there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants