Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in node:10-alpine(3.10) when running in AWS and unprivileged only #1158

Open
tomelliff opened this issue Nov 21, 2019 · 11 comments
Open
Labels

Comments

@tomelliff
Copy link

tomelliff commented Nov 21, 2019

Slightly confused by what's happening here but we have a Docker image that was built with node:10-alpine and when this was rebuilt with Alpine 3.10 instead of 3.9 after this was moved to 3.10 we had the container enter a crash loop with a segfault during the startup of the application.

Rolling back to the previous image and also rolling forward with things pinned to node:10-alpine3.9 seems to make this go away. More weirdly I can't reproduce this on non AWS instances but can reliably reproduce on multiple AWS instances. I also noticed that when the container is running with --privileged then it works fine.

Looking at the non debug build core dump it looks to be an issue in musl but I don't yet know what's triggering that without debug symbols:

#0  0x00007fe1375ee07e in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00007fe1375eb4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fe134c22b64 in ?? ()
#3  0x0000000000000000 in ?? ()

I'm also very confused why it wouldn't be segfaulting like this when it's ran outside of AWS or when the container is ran as privileged.

Any ideas on how I can debug this further?

@kennethaasan
Copy link

I see a similar issue. Switching to node:10-alpine3.9 seemed to work

@schlippi
Copy link

schlippi commented Nov 21, 2019

Same issue, same mitigation as OP running on AWS Kops 1.14 instances. Our other clusters with lower Kops version are not affected so it seems to be something tied to OS.

@olemarkus
Copy link

@schlippi curious about the kops thing there. kops 1.14 with k8s 1.13 also fails?

@evisong
Copy link

evisong commented Nov 22, 2019

Me too... I'm using node:10.17.0-alpine, it was based on Alpine 3.9 for weeks.

But it seems this commit:
c6bc44e#diff-b24491fb48497b165ae0f777c31da853 brings a breaking change.

Since then, the tag 10.17.0-alpine became an alias of 10.17.0-alpine3.10, the old one was renamed as 10.17.0-alpine3.9.

I guess this Segfault in node:10-alpine issue happens a lot on AWS this week. Two options to fix:

@rreinurm
Copy link

I would add its not happening only in AWS. Same issue started to appear in my on-premise environment specially with services using node-rdkafka library. Changing services to us node:10-alpine3.9 fixed the issue.

@schlippi
Copy link

schlippi commented Nov 25, 2019

@schlippi curious about the kops thing there. kops 1.14 with k8s 1.13 also fails?

@olemarkus
Apparently the issue is related to OS / node instances. We didn't have the problem with Kops 1.10 running on m4 instances but got it after upgrading to Kops 1.11 with m5 instances with the same docker images.
Oh, forgot to mention that it also affects node-8.

@prattmic
Copy link

prattmic commented Nov 27, 2019

re #1158 (comment):

The stack size issue crash from #813 (comment) is fixed in Node in nodejs/node@5539d7e (I ran a bisect), which is included in v13, but not v10. A related libuv fix (libuv/libuv@d89bd15) is also included in Node v13.

That crash reproduces on both node:10.17-alpine3.9 and node:10.17-alpine3.10 (i.e., node:10-alpine). So, while I agree that this seems to be a stack size issue, it doesn't seem to be quite identical, since it existed on the old version as well.

@Tahvok
Copy link

Tahvok commented Dec 11, 2019

We have encountered the same issue when using the Skylake cpu platform on GPC while trying to run the grpc package

@Lorenz-N
Copy link

Lorenz-N commented Jan 20, 2020

Had the exact same issue but not related to AWS. We use Docker mainly for offline installations and it started to randomly happen on just a few machines. Build & deploy via GitLab CI (using the latest alpine) was always fine so it was a bit hard to track the issue down. The health check error via docker inspect at least gave a small hint ... but it wasn't that meaningfull either:

OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown

cannot exec in a stopped state: unknown

As mentioned already reverting to 3.9 works for now.

EDIT:
Not sure if it helps ... i've seen very similar errors (with just slightly different error numbers) related to kernel and runc in the past, but mainly for CentOS. However, i assume it's hardware or CPU related as it only happens on every second machine ... compared different setups with clean Win 10 Pro + Docker installations.

Also i've noticed strange CPU peaks (exceeding 100%) when running docker stats during this sort of restart-loop:
restart loop cpu peak

@Nazgolze
Copy link

fwiw I ran into this recently with node:10-alpine 3.10 and 3.11.
3.9 does not have the issue.

I can get node:10-alpine 3.10 and 3.11 to work if I copy Docker's default seccomp profile, add membarrier to the whitelisted syscalls and use that

@dio
Copy link

dio commented Mar 14, 2020

@tomelliff could you help to disclose the spec of the AWS VM, docker version, etc? Since I failed to reproduce this. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

13 participants