Segfault in node:10-alpine(3.10) when running in AWS and unprivileged only #1158

tomelliff · 2019-11-21T11:23:39Z

Slightly confused by what's happening here but we have a Docker image that was built with node:10-alpine and when this was rebuilt with Alpine 3.10 instead of 3.9 after this was moved to 3.10 we had the container enter a crash loop with a segfault during the startup of the application.

Rolling back to the previous image and also rolling forward with things pinned to node:10-alpine3.9 seems to make this go away. More weirdly I can't reproduce this on non AWS instances but can reliably reproduce on multiple AWS instances. I also noticed that when the container is running with --privileged then it works fine.

Looking at the non debug build core dump it looks to be an issue in musl but I don't yet know what's triggering that without debug symbols:

#0  0x00007fe1375ee07e in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00007fe1375eb4b6 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fe134c22b64 in ?? ()
#3  0x0000000000000000 in ?? ()

I'm also very confused why it wouldn't be segfaulting like this when it's ran outside of AWS or when the container is ran as privileged.

Any ideas on how I can debug this further?

The text was updated successfully, but these errors were encountered:

kennethaasan · 2019-11-21T13:03:41Z

I see a similar issue. Switching to node:10-alpine3.9 seemed to work

schlippi · 2019-11-21T13:04:44Z

Same issue, same mitigation as OP running on AWS Kops 1.14 instances. Our other clusters with lower Kops version are not affected so it seems to be something tied to OS.

olemarkus · 2019-11-21T13:19:24Z

@schlippi curious about the kops thing there. kops 1.14 with k8s 1.13 also fails?

evisong · 2019-11-22T06:38:55Z

Me too... I'm using node:10.17.0-alpine, it was based on Alpine 3.9 for weeks.

But it seems this commit:
c6bc44e#diff-b24491fb48497b165ae0f777c31da853 brings a breaking change.

Since then, the tag 10.17.0-alpine became an alias of 10.17.0-alpine3.10, the old one was renamed as 10.17.0-alpine3.9.

I guess this Segfault in node:10-alpine issue happens a lot on AWS this week. Two options to fix:

Roll back to 10.17.0-alpine3.9, or
Fix the musl-libc-glibc thread stack size difference by following Error with Node:8-alpine docker image on AWS using an M5 instance type #813 (comment) Option 2 (https://github.com/jubel-han/dockerfiles/blob/master/node/Dockerfile)

rreinurm · 2019-11-22T09:46:58Z

I would add its not happening only in AWS. Same issue started to appear in my on-premise environment specially with services using node-rdkafka library. Changing services to us node:10-alpine3.9 fixed the issue.

schlippi · 2019-11-25T08:07:58Z

@schlippi curious about the kops thing there. kops 1.14 with k8s 1.13 also fails?

@olemarkus
Apparently the issue is related to OS / node instances. We didn't have the problem with Kops 1.10 running on m4 instances but got it after upgrading to Kops 1.11 with m5 instances with the same docker images.
Oh, forgot to mention that it also affects node-8.

prattmic · 2019-11-27T20:08:19Z

re #1158 (comment):

The stack size issue crash from #813 (comment) is fixed in Node in nodejs/node@5539d7e (I ran a bisect), which is included in v13, but not v10. A related libuv fix (libuv/libuv@d89bd15) is also included in Node v13.

That crash reproduces on both node:10.17-alpine3.9 and node:10.17-alpine3.10 (i.e., node:10-alpine). So, while I agree that this seems to be a stack size issue, it doesn't seem to be quite identical, since it existed on the old version as well.

Tahvok · 2019-12-11T18:34:31Z

We have encountered the same issue when using the Skylake cpu platform on GPC while trying to run the grpc package

Lorenz-N · 2020-01-20T22:43:51Z

Had the exact same issue but not related to AWS. We use Docker mainly for offline installations and it started to randomly happen on just a few machines. Build & deploy via GitLab CI (using the latest alpine) was always fine so it was a bit hard to track the issue down. The health check error via docker inspect at least gave a small hint ... but it wasn't that meaningfull either:

OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused \"process_linux.go:101: executing setns process caused \\\"exit status 1\\\"\": unknown

cannot exec in a stopped state: unknown

As mentioned already reverting to 3.9 works for now.

EDIT:
Not sure if it helps ... i've seen very similar errors (with just slightly different error numbers) related to kernel and runc in the past, but mainly for CentOS. However, i assume it's hardware or CPU related as it only happens on every second machine ... compared different setups with clean Win 10 Pro + Docker installations.

Also i've noticed strange CPU peaks (exceeding 100%) when running docker stats during this sort of restart-loop:

Nazgolze · 2020-01-22T16:11:40Z

fwiw I ran into this recently with node:10-alpine 3.10 and 3.11.
3.9 does not have the issue.

I can get node:10-alpine 3.10 and 3.11 to work if I copy Docker's default seccomp profile, add membarrier to the whitelisted syscalls and use that

dio · 2020-03-14T13:17:52Z

@tomelliff could you help to disclose the spec of the AWS VM, docker version, etc? Since I failed to reproduce this. Thank you!

prattmic mentioned this issue Dec 5, 2019

Implement membarrier google/gvisor#267

Open

lovell mentioned this issue Jan 21, 2020

Segmentation fault during build phase, and during execution lovell/sharp#2048

Closed

Ugzuzg mentioned this issue Mar 11, 2020

Add missing Alpine 3.9 image for node.js v13 #1175

Closed

uhop mentioned this issue Jul 8, 2020

Support-Request: node:12.18.1-alpine does not compile node-re2 uhop/node-re2#72

Closed

StevenReitsma mentioned this issue Aug 5, 2020

Bumping central dashboard node version to v12.18.3 kubeflow/kubeflow#5016

Merged

PeterDaveHello added the alpine label Oct 6, 2020

aguilarm mentioned this issue Apr 1, 2021

"Segmentation fault" on Vercel (when using Plaiceholder/Sharp) vercel/next.js#23531

Closed

Toucan-Sam mentioned this issue Apr 8, 2021

Crashing every minute nzbget/nzbget#741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in node:10-alpine(3.10) when running in AWS and unprivileged only #1158

Segfault in node:10-alpine(3.10) when running in AWS and unprivileged only #1158

tomelliff commented Nov 21, 2019 •

edited

Loading

kennethaasan commented Nov 21, 2019

schlippi commented Nov 21, 2019 •

edited

Loading

olemarkus commented Nov 21, 2019

evisong commented Nov 22, 2019

rreinurm commented Nov 22, 2019

schlippi commented Nov 25, 2019 •

edited

Loading

prattmic commented Nov 27, 2019 •

edited

Loading

Tahvok commented Dec 11, 2019

Lorenz-N commented Jan 20, 2020 •

edited

Loading

Nazgolze commented Jan 22, 2020

dio commented Mar 14, 2020 •

edited

Loading

Segfault in node:10-alpine(3.10) when running in AWS and unprivileged only #1158

Segfault in node:10-alpine(3.10) when running in AWS and unprivileged only #1158

Comments

tomelliff commented Nov 21, 2019 • edited Loading

kennethaasan commented Nov 21, 2019

schlippi commented Nov 21, 2019 • edited Loading

olemarkus commented Nov 21, 2019

evisong commented Nov 22, 2019

rreinurm commented Nov 22, 2019

schlippi commented Nov 25, 2019 • edited Loading

prattmic commented Nov 27, 2019 • edited Loading

Tahvok commented Dec 11, 2019

Lorenz-N commented Jan 20, 2020 • edited Loading

Nazgolze commented Jan 22, 2020

dio commented Mar 14, 2020 • edited Loading

tomelliff commented Nov 21, 2019 •

edited

Loading

schlippi commented Nov 21, 2019 •

edited

Loading

schlippi commented Nov 25, 2019 •

edited

Loading

prattmic commented Nov 27, 2019 •

edited

Loading

Lorenz-N commented Jan 20, 2020 •

edited

Loading

dio commented Mar 14, 2020 •

edited

Loading