Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF #4305

Closed
eoshea-cmt opened this issue Oct 3, 2023 · 5 comments · Fixed by #4346

Comments

@eoshea-cmt
Copy link

Description

During docker (compose) builds, we occasionally see this error in our CI:
failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF

This can happen at various stages in docker builds, including:

  • importing cache manifest from ...
  • load build context
  • RUN pip install --upgrade pip

We used our instance monitoring to investigate if there was any correlation with resource uses. We looked into network, memory, and cpu utilization and none of these spiked in correlation to these errors.

This error can kill multiple builds happening in parallel on our CI nodes, but it also happens to single builds as well.

Expected behaviour

docker compose build progress

Actual behaviour

docker compose builds fail

Buildx version

github.com/docker/buildx v0.11.2 9872040

Docker info

+ docker system info
Client: Docker Engine - Community
 Version:    24.0.6
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 16
  Running: 16
  Paused: 0
  Stopped: 0
 Images: 16
 Server Version: 24.0.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
 runc version: v1.1.8-0-g82f18fe
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.15.0-1044-aws
 Operating System: Ubuntu 20.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.67GiB
 Name: ip-10-10-15-71
 ID: 8d7a5a77-4225-4887-a2c3-419a6c5ab76e
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: cmtlouis
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.17.0.0/12, Size: 20
   Base: 192.168.0.0/16, Size: 24

Builders list

+ docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS  BUILDKIT             PLATFORMS
default * docker                                       
  default default         running v0.11.6+616c3f613b54 linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

Configuration

We are not able to consistently reproduce our issues, though we are building multiple images with multiple stages using docker-compose which may be relevant

we also run multiple jobs on the same instances in our CI, so multiple docker compose builds are happening in parallel at times. Furthermore it seems this error can happen to multiple docker compose builds at the same time which running on the same node in parallel.

Build logs

dockerlog.txt

Additional info

I previously created this ticket for buildx before the conversation pointed to this being a buildkit issue

seems like it could be a similar (but different) error to:
microsoft/vscode-remote-release#7958
or #4157

I'm wondering if it is some other race condition that only happens occasionally.
It does not seem correlated to resource usage.

@eoshea-cmt
Copy link
Author

the error in the logs above is:
fatal error: concurrent map iteration and map write
looks like some not thread safe code, similar to: #4157

@jedevc
Copy link
Member

jedevc commented Oct 13, 2023

Is there an additional stack trace in the logs@eoshea-cmt? Without that, there's not enough information to work out whether this is a duplicate of #4157, or a separate issue.

@eoshea-cmt
Copy link
Author

@jedevc I shared the stack trace in the "Build logs" section of my issue, see here:

dockerlog.txt

@jedevc
Copy link
Member

jedevc commented Oct 16, 2023

cc @tonistiigi

fatal error: concurrent map iteration and map write
goroutine 674535 [running]:
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc0048946c0, {0x55d234548fa0, 0xc007576630}, {0x7fe565d37fe8?, 0xc002479380}, {0xc001d3ec30, 0x2, {0x55d234527500, 0xc002f96a98}, 0xc004158260, ...})
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/exporter.go:201 +0xa07
github.com/moby/buildkit/solver/llbsolver.inlineCache({0x55d234548fa0, 0xc007576270}, {0x55d23454f518?, 0xc002479380}, {0x7fe56c08f4b0, 0xc0021a3650}, {{0x55d234557840, 0x55d23581aef0}, 0x0, 0x0}, ...)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:847 +0x585
github.com/moby/buildkit/solver/llbsolver.runInlineCacheExporter.func1({0x55d234548fa0, 0xc007576270}, {0xc000a442a0?, 0x7fe565d67100?})
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:645 +0x1c7
github.com/moby/buildkit/solver/llbsolver.inBuilderContext.func1({0x55d234548fa0, 0xc0075761b0}, {0x55d234527500, 0xc002f96a38})
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:922 +0x1d8
github.com/moby/buildkit/solver.(*Job).InContext(0xc002dca3c0, {0x55d234548fa0, 0xc00e84ca20}, 0xc002f96a20)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/jobs.go:610 +0x136
github.com/moby/buildkit/solver/llbsolver.inBuilderContext({0x55d234548fa0, 0xc00e84ca20}, {0x55d234540a40, 0xc002dca3c0}, {0x55d233b1a51a, 0x21}, {0xc00319a390, 0x26}, 0xc00f5ebd80)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:918 +0x1b3
github.com/moby/buildkit/solver/llbsolver.rungithub.com/moby/buildkit/solver.InlineCacheExporter({0x55d234548fa0, 0xc00e84ca20}, {0x55d23453d980?, 0xc00e84cf60}, 0xc00d839ba0, 0xc002dca3c0, 0xc0021a3e90)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:643 +0x1cb
github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve(0xc0005caa80, {0x55d234548fa0, 0xc00e84ca20}, {0xc00b4ed420, _}, {_, _}, {0x0, 0x0, {0x0, ...}, ...}, ...)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:542 +0x16b3
github.com/moby/buildkit/control.(*Controller).Solve(0xc00053c9a0, {0x55d234548fa0, 0xc00e84ca20}, 0xc009615040)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/control/control.go:433 +0x12c8
github.com/moby/buildkit/api/services/control._Control_Solve_Handler.func1({0x55d234548fa0, 0xc00e84ca20}, {0x55d2344b8260?, 0xc009615040})
        /root/build-deb/engine/vendor/github.com/moby/buildkit/api/services/control/control.pb.go:2438 +0x78
github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor({0x55d234548fa0?, 0xc00e84ca20?}, {0x55d2344b8260?, 0xc009615040?}, 0xc0077aba50?, 0x55d23416c160?)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/util/grpcerrors/intercept.go:14 +0x3d
github.com/moby/buildkit/api/services/control._Control_Solve_Handler({0x55d234436940?, 0xc00053c9a0}, {0x55d234548fa0, 0xc00e84ca20}, 0xc00da83e30, 0x55d234514a80)
        /root/build-deb/engine/vendor/github.com/moby/buildkit/api/services/control/control.pb.go:2440 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000d30d20, {0x55d234558c60, 0xc001942fc0}, 0xc010833440, 0xc0009b6ab0, 0x55d2357325d8, 0x0)
        /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1340 +0xd33
google.golang.org/grpc.(*Server).handleStream(0xc000d30d20, {0x55d234558c60, 0xc001942fc0}, 0xc010833440, 0x0)
        /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:1713 +0xa36
google.golang.org/grpc.(*Server).serveStreams.func1.2()
        /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:965 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /root/build-deb/engine/vendor/google.golang.org/grpc/server.go:963 +0x28a

It looks like ids should be protected by a mutex, but only is in some scenarios - do you know if the lack of protection in some cases is intentional or just an oversight? Seems to be similar to #2041 and #2458.

@jedevc jedevc changed the title failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF #2064 failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF Oct 16, 2023
@eoshea-cmt
Copy link
Author

eoshea-cmt commented Oct 23, 2023

@jedevc is there someone I can reach out to about getting the version of buildkit with this fixed included in docker-compose?
similarly, I noticed the docker-compse-plugin apt package is behind the latest docker-compose
any insight into who/how to reach out to about this would be greatly appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants