Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent map write on cache export #2041

Open
griffous opened this issue Mar 26, 2021 · 23 comments
Open

Concurrent map write on cache export #2041

griffous opened this issue Mar 26, 2021 · 23 comments
Labels

Comments

@griffous
Copy link

griffous commented Mar 26, 2021

Overview
I am seeing consistent crashes when parallel building many large stages at once with buildkit and docker buildx bake. The crashes seem to happen when buildkit is attempting to do many image exports in parallel.

Both buildx --load and --push outputs result in crashes, using either the inbuildt, or docker-container-based buildkit 0.8.2 instances.

When buildkit dies, the progress output keeps incrementing time, but nothing is actually happening. Typically I observe an export of some kind running, and in the instances where I don't, I think it's just that the UI didn't refresh the active threads before buildkit died.

 => [<stage name here>] exporting to image
 => => pushing layers

or

 => CACHED [<top stage> <substage> 4/6] COPY 

Environment
Software: Ubuntu 20.10, Docker 20.10.5, docker buildx bake.
Hardware: 8 cores, 64GM RAM, NVME storage.
Build info: Largish (~1200 line) multistage Dockerfile. There are around 30 stages, 11 of which are end stages.
Most of the images are quite large, in the 3-10GB range (Oracle Database and Weblogic images).
The build uses these features: inline cache export, cache-from, --mount=type=secret, and --mount=type=cache.

buildkit image (tags) which I can crash with our bake/build:

  • :v0.8.2
  • :nightly
  • :buildx-stable-1
    . :cni-v0.8.2 (This is v.0.8.2 with the CNI binaries, running in CNI networking mode)

A typical bake command looks like this.

time docker buildx bake --set *.labels.com.tiatechnology.authors=xxxx --set *.labels.com.tiatechnology.buildhost=jeeves --set *.labels.com.tiatechnology.buildurl=local --set *.labels.com.tiatechnology.created=2021-03-26 13:50:34 CET --set *.labels.com.tiatechnology.dockerversions=Client/Server versions: 20.10.5/20.10.5 --set *.labels.com.tiatechnology.revision=d498421126227c7628ea99cd0c0492415989825b --set *.labels.com.tiatechnology.source=https://git.xxxxx--set *.labels.com.tiatechnology.version=monodockerfile --builder buildkit-v0.8.2 --progress=auto --set *.cache-to=type=inline --set *.secrets=id=token,src=.token --push weblogic-11g-base weblogic-12c-base weblogic-10.3.6 weblogic-12.2.1.1 weblogic-12.2.1.3 oracledb-12.2-base oracledb-19.3-base

We're using an internal docker registry, and our images reply on internal build artifacts, so it can't be used for repo. If you can isolate the internal conditions which cause the crash from the logs, I might be able to construct a more generic test case - at this stage I don't understand the internal conditions leading to the crash.

Observations
When there are many targets which need building (cache misses), buildkit has fewer concurrent exports to process, which tends to result in a more stable for the same set/count of targets.
When all layers are cached however, 6 or more bake targets tends to crash buildkit within a few seconds of the bake & container starting.
There isn't a magic target number which triggers crashes. A run of 5 targets might crash on first try, but after a pause and retry, it will run successfully the second time. The more targets there are, the higher the chance of a crash.

Logs
The attachment
3dba0d7f53934d47fe23b56d05ddf2d01362610d9c580e9cbedd9dc788a8f05c-json.log
is the container log for the builder created and crashed using these commands:

# Clean-up
docker buildx rm
# Builder creation
export BUILDKIT_BUILDER_NAME="buildkit-v0.8.2"
docker buildx create --name "${BUILDKIT_BUILDER_NAME}" --driver-opt image=moby/buildkit:v0.8.2 --use
# Build
docker buildx bake --set *.labels.com.tiatechnology.authors=REDACTED --set *.labels.com.tiatechnology.buildhost=jeeves --set *.labels.com.tiatechnology.buildurl=local --set *.labels.com.tiatechnology.created=2021-03-26 14:38:52 CET --set *.labels.com.tiatechnology.dockerversions=Client/Server versions: 20.10.5/20.10.5 --set *.labels.com.tiatechnology.revision=14e1287e63980f9e9fbbffb83ec5598e0e49e77d --set *.labels.com.tiatechnology.source=https://git.REDACTED --set *.labels.com.tiatechnology.version=monodockerfile --builder v0.8.2 --progress=auto --set *.cache-to=type=inline --set *.secrets=id=token,src=.token --push weblogic-11g-base weblogic-12c-base weblogic-10.3.6 weblogic-12.2.1.1 weblogic-12.2.1.3 oracledb-base oracledb-12.2-base oracledb-12.2-install oracledb-12.2-install-tiadb oracledb-12.2-install-wlsdb oracledb-12.2-tiadb oracledb-12.2-wlsdb oracledb-19.3-base oracledb-19.3-install oracledb-19.3-install-tiadb oracledb-19.3-tiadb

The second attachment ea41283f5920e3f10471b160ceed8b684eea37ec3351936dc50e74b393772b6b-json.log
was created much the same way, but with --debug used instead:

# Clean-up
docker buildx rm
# Builder creation
export BUILDKIT_BUILDER_NAME="buildkit-v0.8.2"
# Build
docker buildx create --name "${BUILDKIT_BUILDER_NAME}" --driver-opt image=moby/buildkit:v0.8.2  --buildkitd-flags '--debug' --use
docker buildx bake --set *.labels.com.tiatechnology.authors=REDACTED --set *.labels.com.tiatechnology.buildhost=jeeves --set *.labels.com.tiatechnology.buildurl=local --set *.labels.com.tiatechnology.created=2021-03-26 14:17:55 CET --set *.labels.com.tiatechnology.dockerversions=Client/Server versions: 20.10.5/20.10.5 --set *.labels.com.tiatechnology.revision=d498421126227c7628ea99cd0c0492415989825b --set *.labels.com.tiatechnology.source=https://git.REDACTED --set *.labels.com.tiatechnology.version=monodockerfile --builder v0.8.2 --progress=auto --set *.cache-to=type=inline --set *.secrets=id=token,src=.token --push weblogic-11g-base weblogic-12c-base weblogic-10.3.6 weblogic-12.2.1.1 weblogic-12.2.1.3 oracledb-12.2-base oracledb-19.3-base

On Tonis' suggestion, I will try to build a buildkitd/container with the go race detector enabled: wish me luck!

@griffous
Copy link
Author

griffous commented Mar 29, 2021

Tonis kindly created a race-detector enabled master/buildkit image for me to try.

Using this builder, I do see DATA RACE messages printed in the container log, so that's working correctly, and hopefully this will help the developers who look into this case :)
~$ grep -c 'DATA RACE' 042fba650e4943743d2dee7689228e48ef35f75de2452bbe64bbc4a593018148-json.log
35
042fba650e4943743d2dee7689228e48ef35f75de2452bbe64bbc4a593018148-json.log

With this image, and a fresh set of docker prunes, I am not seeing builder crashes, however some stages/images which previously had cache-from (registry) hits are now rebuilding instead. And during their rebuilds they're actually failing with issues having to do with the contents of cache mounts (freshly downloaded files suddenly vanishing, causing install scripts to fail).

This might have more to do with the switch from 0.8.2 to master, so I will try to rebuild our entire library, with a new set of image tags, push those, and continue trying to reproduce the (export) crash with this master/race-detector builder.

Let me know if there is anything else I can add to the case which would be of help?

@griffous
Copy link
Author

griffous commented Mar 29, 2021

Update: the master/race image doesn't seem to be able to reliably 'prepare build cache for export', even for a small number of targets.
The buildkitd process sits at 100% CPU (per build), and I gave up after waiting after 20 minutes of no visible progress. This is without --load or --push (internal buildcache only). I expect I'm hitting this issue: #2009

So I switched to moby/master and continued building the entire 30 stage library as parallel as buildkit could.
This worked until reaching our OracleDB images, which fail to build because they need isolated networking, and the standard moby/buildkit images don't include the CNI binaries.

I built and switched to a master + CNI binaries/config image and was able to successfully complete the build and push of our 11 oracle images/targets.
I then reran again as a sanity check, and it almost immediately crashed the builder container. Interesting! Everything else was the same, so why is it crashing on re-run?

More digging brought me to this bug #1984.
The mention of 400 TCP connections got me thinking. We have 11 targets, each of which have two --cache-froms, and by this stage of the build, we're already many layers deep. Altogether, I can well imagine that they are a lot of TCP threads attempting to run here and when everything is already cached, I guess they all run in quick succession, whereas on first build, the TCP connections were spread out along the duration of the build.

I commented out all of the cache-froms, reran the same 11 targets (cached on disk), and it successfully built and exported! So, --cache-from seems to be at least one of the triggers which is causing my crashes.

Next I tried again using the tonistiigi/buildkit:tcp-concurrency-fix image mentioned in the other bug, and that builder can build the same 11 targets with --cache-froms enabled! Excellent.

Next I tried building the full library (30 targets), and that crashed it:
3c0158ef94009af033a571af97c9a29446c8f01495ab673b96168baec20a1bde-json.log

This has been really informative though, and I think it's pointing to two issues:

  • TCP/network concurrency is clearly one part of the problem.
  • Possibly another issue during the final assembly of the image in tar format?

Is there any chance we can get andupdated #1984 implementation against master, so I can try again with the latest code?

@tonistiigi
Copy link
Member

@griffous please test #2051

@griffous
Copy link
Author

griffous commented Mar 31, 2021

Thanks @tonistiigi Test results:
It definitely seems much improved, thanks.!
With our core set of 14 final-stage targets I don't see a crash on first, or cached builds, but I can still crash the master image when using a fuller set of targets (26 in one case, 28 in the other)

On very first build I am also seeing a strange issue in which cache volumes are not writable to the run command.
If I immediately retry the build, it works on second run. I don't know if this could be related to these changes, but I will try to create a repo for this, in a separate bug.
Update: bug filed with testcase: #2053

When building the second time, I was able to capture a crash on a remote host
1f1ecda717baee7fb9f6da80fd37003e255b3ecdccb6fe7be8a8d283f70b8ab8-json.log
using the following workflow:

# Build setup
docker buildx rm
docker start docker_builder_es-db_helper
docker volume prune -f && docker builder prune -af
export BUILDKIT_BUILDER_NAME="master-cni"
docker build --build-arg BUILDKIT_VERSION=master -t local/buildkit:${BUILDKIT_BUILDER_NAME} -f buildkit/Dockerfile buildkit
docker buildx create --name "${BUILDKIT_BUILDER_NAME}" --driver-opt image=local/buildkit:${BUILDKIT_BUILDER_NAME} --buildkitd-flags '--oci-worker-net=cni --oci-worker-gc-keepstorage 200000' --use

# Build wrapper command:
[opc@tic-arc-dev docker-es-library]$ ./bake.sh '*' -p
Image targets list contains es-rest, or es-adf: ensuring a helper runtimeContainer is running for this build.
TIADB_HOST=[172.17.0.2]
Building [26] image target(s) with builder master-cni: [ol7-base ol7-base-sqlplus ol7-base-jre es-print-4.1.6 oracledb-12.2-base oracledb-12.2-install oracledb-12.2-install-tiadb oracledb-12.2-install-wlsdb oracledb-12.2-tiadb oracledb-12.2-wlsdb oracledb-19.3-install oracledb-19.3-install-tiadb oracledb-19.3-tiadb weblogic-10.3.6 weblogic-12.2.1.1 weblogic-12.2.1.3 es-rest-1.0.0.0 es-rest-7.5.2.0 es-db-7.5.2.0 es-db-7.7.0.34 es-db-7.8.1.6 es-db-7.9.0.2 es-adf-7.5.2.0 es-adf-7.7.0.34 es-adf-7.8.1.6 es-adf-7.9.0.1]
Final build command:
 time docker buildx bake --set *.labels.com.tiatechnology.authors=REDACTED --set *.labels.com.tiatechnology.buildhost=tic-arc-dev.tia.local --set *.labels.com.tiatechnology.buildurl=local --set *.labels.com.tiatechnology.created=2021-03-31 11:44:39 CEST --set *.labels.com.tiatechnology.dockerversions=Client/Server versions: 20.10.5/20.10.5 --set *.labels.com.tiatechnology.revision=4f3b1f7ba677899974fbd2adcf7bb27fff600577 --set *.labels.com.tiatechnology.source=https://git.REDACTED --set *.labels.com.tiatechnology.version=monodockerfile --builder master-cni --progress=auto --set *.cache-to=type=inline --set *.secrets=id=token,src=.token --push ol7-base ol7-base-sqlplus ol7-base-jre es-print-4.1.6 oracledb-12.2-base oracledb-12.2-install oracledb-12.2-install-tiadb oracledb-12.2-install-wlsdb oracledb-12.2-tiadb oracledb-12.2-wlsdb oracledb-19.3-install oracledb-19.3-install-tiadb oracledb-19.3-tiadb weblogic-10.3.6 weblogic-12.2.1.1 weblogic-12.2.1.3 es-rest-1.0.0.0 es-rest-7.5.2.0 es-db-7.5.2.0 es-db-7.7.0.34 es-db-7.8.1.6 es-db-7.9.0.2 es-adf-7.5.2.0 es-adf-7.7.0.34 es-adf-7.8.1.6 es-adf-7.9.0.1
[+] Building 48.3s (131/514)
..
<buildkit output until hanging on step 131>

[opc@tic-arc-dev ~]$ docker exec -it buildx_buildkit_master-cni0 buildkitd --version
buildkitd github.com/moby/buildkit 3b49f99 3b49f992caeed2338de89ef580d9be98303fd075

The buildkit image was generated by the build above, and this buildkit/Dockerfile

bake.sh is a wrapper script I use to generate the final docker buildx bake command which is printed to screen, before being run.

I also produced the crash on my local machine, here is a second example:
ed7cc655dde6bf87d935a9da399e4c7500e94786cf310d5d4c3feac32047c92b-json.log

@griffous
Copy link
Author

griffous commented Apr 6, 2021

Hi team,

Coming back to this after easter, today am getting this crash attempting to build 11 targets, using the same master + cni builder.

0b74ad4ed96bfc66bf7f6bf760fc504bcb9d3e3c6fbb1c52a1ad37c2941e0f54-json.log

...
time="2021-04-06T10:22:34Z" level=warning msg="invalid image config with unaccounted layers"
time="2021-04-06T10:22:34Z" level=warning msg="failed to update distribution source for layer sha256:144c7d750c8c407810289ba67570f371a1aec35bab861cd7bd4cda7292fb6b53: content digest sha256:144c7d750c8c407810289ba67570f371a1aec35bab861cd7bd4cda7292fb6b53: not found"
<lots of messages about not found>
...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xc023c8]

goroutine 16213 [running]:
github.com/moby/buildkit/solver.(*sharedOp).CalcSlowCache.func2(0x1515840, 0xc0012e0b40, 0x0, 0x0, 0x0, 0x0)
	/src/solver/jobs.go:650 +0x8c8
github.com/moby/buildkit/util/flightcontrol.(*call).run(0xc000a55dd0)
	/src/util/flightcontrol/flightcontrol.go:121 +0xee
sync.(*Once).doSlow(0xc000a55e30, 0xc0012e2320)
	/usr/local/go/src/sync/once.go:66 +0xe3
sync.(*Once).Do(0xc000a55e30, 0xc0012e2320)
	/usr/local/go/src/sync/once.go:57 +0x45
created by github.com/moby/buildkit/util/flightcontrol.(*call).wait
	/src/util/flightcontrol/flightcontrol.go:156 +0x3a9

Is there any more information I can collect for you @tonistiigi?

@tonistiigi
Copy link
Member

#2067 handles the last error. Please provide a reproducer how to hit this error though.

@griffous
Copy link
Author

griffous commented Apr 9, 2021

Thanks. I'm keeping on eye on dockerhub for an updated image with this change. It doesn't look like it's hit nightly or master yet: https://hub.docker.com/r/moby/buildkit/tags

I will try to repo it with a simpler test case.

@griffous
Copy link
Author

Update - my recent develpment/builds against the master-based CNI-enabled image that we use has been proving really reliable over the past few days. I actually thought I was going to have to update this bug to say that this issue appears fixed.

And then I managed to crash it again today, while attempting to build 25 stages.

Top of the log here:

time="2021-04-23T13:42:49Z" level=warning msg="invalid image config with unaccounted layers"
fatal error: concurrent map read and map write

goroutine 403224 [running]:
runtime.throw(0x13511ca, 0x21)
	/usr/local/go/src/runtime/panic.go:774 +0x72 fp=0xc003f6b8c0 sp=0xc003f6b890 pc=0x42f892
runtime.mapaccess1_faststr(0x11423c0, 0xc004435e30, 0x133bae8, 0x11, 0xc000507a01)
	/usr/local/go/src/runtime/map_faststr.go:21 +0x44f fp=0xc003f6b930 sp=0xc003f6b8c0 pc=0x412ebf
github.com/moby/buildkit/cache/metadata.(*StorageItem).Get(0xc000e27950, 0x133bae8, 0x11, 0x0)
	/src/cache/metadata/metadata.go:267 +0x63 fp=0xc003f6b970 sp=0xc003f6b930 pc=0xc45b33
github.com/moby/buildkit/cache.getBlobChainID(0xc000e27950, 0xc00419b840, 0x31)
	/src/cache/metadata.go:132 +0x44 fp=0xc003f6b9b8 sp=0xc003f6b970 pc=0xc616a4
github.com/moby/buildkit/cache.(*immutableRef).Info(0xc003f65e60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/src/cache/refs.go:333 +0x127 fp=0xc003f6baa8 sp=0xc003f6b9b8 pc=0xc67b27
github.com/moby/buildkit/cache.computeBlobChain.func2.1(0x15168c0, 0xc00407dd40, 0x0, 0x0, 0x0, 0x0)
	/src/cache/blobs.go:56 +0x8c fp=0xc003f6be18 sp=0xc003f6baa8 pc=0xc6c88c
github.com/moby/buildkit/util/flightcontrol.(*call).run(0xc0014c0480)
	/src/util/flightcontrol/flightcontrol.go:121 +0xee fp=0xc003f6bef8 sp=0xc003f6be18 pc=0xa6bbde
github.com/moby/buildkit/util/flightcontrol.(*call).run-fm()
	/src/util/flightcontrol/flightcontrol.go:117 +0x2a fp=0xc003f6bf10 sp=0xc003f6bef8 pc=0xa6d93a
sync.(*Once).doSlow(0xc0014c04e0, 0xc009104630)
	/usr/local/go/src/sync/once.go:66 +0xe3 fp=0xc003f6bfb0 sp=0xc003f6bf10 pc=0x46e263
sync.(*Once).Do(0xc0014c04e0, 0xc009104630)
	/usr/local/go/src/sync/once.go:57 +0x45 fp=0xc003f6bfd0 sp=0xc003f6bfb0 pc=0x46e175
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc003f6bfd8 sp=0xc003f6bfd0 pc=0x45f2e1
created by github.com/moby/buildkit/util/flightcontrol.(*call).wait
	/src/util/flightcontrol/flightcontrol.go:156 +0x3a9

I know you're after a reproducer, and I'll keep trying to make one a simplified use case.
It's only under big load that this happens, so it's proving a challenge! In the meantime, does this log point to anything new?

@griffous
Copy link
Author

griffous commented Apr 28, 2021

I have made a simplified reproducer: buildkit_crasher.zip
Instructions are in the test_script.txt within the zip.

It uses a containerised registry to keep everything local, and a 3-stage build, fanning out as 1=>3=>30 images.
It's the first cached build which typically crashes my buildkit - usually on the first or second run. If it doesn't crash within 3 runs, I force the cache miss, and repeat until it crashes - usually 1-2 runs does it.

To increase the chances of a crash - and I know how odd this sounds - run a docker stats in the background. Somehow, this seems to make the crash more likely!

Let me know if there is anything else I can do to help with finding/fixing this bug?

Example crash from this reproducer

time="2021-04-28T10:16:32Z" level=info msg="trying next host - response was http.StatusNotFound" host="172.17.0.1:5000"
time="2021-04-28T10:16:32Z" level=info msg="trying next host - response was http.StatusNotFound" host="172.17.0.1:5000"
fatal error: concurrent map iteration and map write

goroutine 100994 [running]:
runtime.throw(0x1363f59, 0x26)
	/usr/local/go/src/runtime/panic.go:774 +0x72 fp=0xc0008fe550 sp=0xc0008fe520 pc=0x42f892
runtime.mapiternext(0xc0008fe828)
	/usr/local/go/src/runtime/map.go:858 +0x579 fp=0xc0008fe5d8 sp=0xc0008fe550 pc=0x40f839
runtime.mapiterinit(0x1148d60, 0xc000e830b0, 0xc0008fe828)
	/usr/local/go/src/runtime/map.go:848 +0x1c3 fp=0xc0008fe5f8 sp=0xc0008fe5d8 pc=0x40f1d3
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00237a150, 0xc000900030, 0xc0008feb58, 0x1dd15c0, 0x7f5f4f65d000)
	/src/cache/remotecache/v1/utils.go:204 +0x498 fp=0xc0008fe928 sp=0xc0008fe5f8 pc=0xc866a8
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0002d2310, 0xc000900030, 0xc0008fee88, 0x1dd15c0, 0xffffffffffffff00)
	/src/cache/remotecache/v1/utils.go:206 +0x1f7 fp=0xc0008fec58 sp=0xc0008fe928 pc=0xc86407
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0003edc70, 0xc000900030, 0xc0008ff1b8, 0x1dd15c0, 0x0)
	/src/cache/remotecache/v1/utils.go:206 +0x1f7 fp=0xc0008fef88 sp=0xc0008fec58 pc=0xc86407
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc000344380, 0xc000900030, 0xc0008ff4e8, 0x1dd15c0, 0x0)
	/src/cache/remotecache/v1/utils.go:206 +0x1f7 fp=0xc0008ff2b8 sp=0xc0008fef88 pc=0xc86407
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00071d960, 0xc000900030, 0xc0008ff818, 0x1dd15c0, 0xc001ce4000)
	/src/cache/remotecache/v1/utils.go:206 +0x1f7 fp=0xc0008ff5e8 sp=0xc0008ff2b8 pc=0xc86407
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc002015500, 0xc000900030, 0xc0008ffb48, 0x1dd15c0, 0x0)
	/src/cache/remotecache/v1/utils.go:206 +0x1f7 fp=0xc0008ff918 sp=0xc0008ff5e8 pc=0xc86407
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0003697a0, 0xc000900030, 0xc0008ffe78, 0x1dd15c0, 0x456300)
	/src/cache/remotecache/v1/utils.go:206 +0x1f7 fp=0xc0008ffc48 sp=0xc0008ff918 pc=0xc86407
github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00253c310, 0xc000900030, 0x41, 0xc00123e240, 0x40)
	/src/cache/remotecache/v1/utils.go:206 +0x1f7 fp=0xc0008fff78 sp=0xc0008ffc48 pc=0xc86407
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).normalize(0xc001423860, 0x0, 0x19)
	/src/cache/remotecache/v1/chains.go:62 +0x3de fp=0xc000900340 sp=0xc0008fff78 pc=0xc8301e
github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).Marshal(0xc001423860, 0xc000fdd290, 0x0, 0x1, 0xc000fdd200)
	/src/cache/remotecache/v1/chains.go:79 +0x40 fp=0xc000900408 sp=0xc000900340 pc=0xc83260
github.com/moby/buildkit/cache/remotecache/inline.(*exporter).ExportForLayers(0xc001423880, 0xc0004205a0, 0xa, 0xa, 0xc001423880, 0xc001423b80, 0x2, 0x14f9dc0, 0xc000822260)
	/src/cache/remotecache/inline/inline.go:42 +0x47 fp=0xc000900c30 sp=0xc000900408 pc=0xc90d37
github.com/moby/buildkit/solver/llbsolver.inlineCache(0x1523720, 0xc00980b3e0, 0x1523a60, 0xc001423880, 0x1528220, 0xc005e942c0, 0x14f9dc0, 0xc000822260, 0x0, 0xc0014238a0, ...)
	/src/solver/llbsolver/solver.go:295 +0x3cf fp=0xc000900dd8 sp=0xc000900c30 pc=0xcf69df
github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve(0xc0002d1200, 0x1523720, 0xc00980b3e0, 0xc002000880, 0x19, 0xc0020008c0, 0x19, 0x0, 0x0, 0xc013ac24a0, ...)
	/src/solver/llbsolver/solver.go:176 +0x15f0 fp=0xc0009012f0 sp=0xc000900dd8 pc=0xcf5c50
github.com/moby/buildkit/control.(*Controller).Solve(0xc00033a480, 0x1523720, 0xc00980b3e0, 0xc0019758c0, 0x0, 0x0, 0x0)
	/src/control/control.go:275 +0x4f3 fp=0xc000901570 sp=0xc0009012f0 pc=0xcffce3
github.com/moby/buildkit/api/services/control._Control_Solve_Handler.func1(0x1523720, 0xc00980b3e0, 0x1300140, 0xc0019758c0, 0xc001ccc300, 0x40, 0x40, 0x1110180)
	/src/api/services/control/control.pb.go:1489 +0x86 fp=0xc0009015b8 sp=0xc000901570 pc=0xb6ba06
github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor(0x1523720, 0xc00980b3e0, 0x1300140, 0xc0019758c0, 0xc001423740, 0xc001423760, 0x40, 0x1110180, 0x1523701, 0xc001ccc300)
	/src/util/grpcerrors/intercept.go:14 +0x6f fp=0xc000901648 sp=0xc0009015b8 pc=0xa0866f
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x1523720, 0xc00980b3e0, 0x1300140, 0xc0019758c0, 0x1523720, 0xc00980b3e0, 0x153f440, 0x1dcf7a0)
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63 fp=0xc0009016a8 sp=0xc000901648 pc=0x9407a3
github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1(0x1523720, 0xc00980b3e0, 0x1300140, 0xc0019758c0, 0xc001423740, 0xc001423780, 0x0, 0x0, 0x14f9580, 0xc000162ed0)
	/src/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go:57 +0x2d3 fp=0xc000901938 sp=0xc0009016a8 pc=0x949993
main.unaryInterceptor.func1(0x1523660, 0xc001ccc2c0, 0x1300140, 0xc0019758c0, 0xc001423740, 0xc001423780, 0x0, 0x0, 0x0, 0x0)
	/src/cmd/buildkitd/main.go:542 +0x17b fp=0xc000901a28 sp=0xc000901938 pc=0x1037f2b
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x1523720, 0xc00980b320, 0x1300140, 0xc0019758c0, 0xc000803e00, 0x0, 0xc0002c4ae8, 0x40cda8)
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x63 fp=0xc000901a88 sp=0xc000901a28 pc=0x9407a3
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x1523720, 0xc00980b320, 0x1300140, 0xc0019758c0, 0xc001423740, 0xc001423760, 0xc0002c4b58, 0x5bec98, 0x1265d60, 0xc00980b320)
	/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xd5 fp=0xc000901af8 sp=0xc000901a88 pc=0x940985
github.com/moby/buildkit/api/services/control._Control_Solve_Handler(0x1256f40, 0xc00033a480, 0x1523720, 0xc00980b320, 0xc000f1a4e0, 0xc0002df8c0, 0x1523720, 0xc00980b320, 0xc000286000, 0x111)
	/src/api/services/control/control.pb.go:1491 +0x14b fp=0xc000901b68 sp=0xc000901af8 pc=0xb5009b
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00036a000, 0x1538480, 0xc001909e00, 0xc000803e00, 0xc000112d20, 0x1d9c7b8, 0x0, 0x0, 0x0)
	/src/vendor/google.golang.org/grpc/server.go:1171 +0x4fd fp=0xc000901e20 sp=0xc000901b68 pc=0x92973d
google.golang.org/grpc.(*Server).handleStream(0xc00036a000, 0x1538480, 0xc001909e00, 0xc000803e00, 0x0)
	/src/vendor/google.golang.org/grpc/server.go:1494 +0xd25 fp=0xc000901f48 sp=0xc000901e20 pc=0x92d515
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc013ac2350, 0xc00036a000, 0x1538480, 0xc001909e00, 0xc000803e00)
	/src/vendor/google.golang.org/grpc/server.go:834 +0xbb fp=0xc000901fb8 sp=0xc000901f48 pc=0x93a6eb
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc000901fc0 sp=0xc000901fb8 pc=0x45f2e1
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/src/vendor/google.golang.org/grpc/server.go:832 +0x204

...

@griffous
Copy link
Author

I have also crashed the race_detector version that @tonistiigi built for me around a month ago.
In the logs I see a panic and lots of WARNING: DATA RACE entries.

If my reproducer doesn't work for you locally, but the race-detector feature and its logs are helpful troubleshooting: please build me a new race-detector enabled image, and I'll happily re-run it again :)

@griffous
Copy link
Author

griffous commented May 27, 2021

I tried this again today, using the current master, and with the new max-parallelism feature from #2049 set to 10.

Unfortunately it's still panicking with the same error: fatal error: concurrent map iteration and map write

Is there anything that can be done to make buildkitd work more reliably on large builds?

@Antiarchitect
Copy link

Antiarchitect commented Jul 27, 2021

I've just built a custom image of docker 20.10.7 with buildkit v0.9.0 for my build farm and now facing similar (to this moby/moby#42422) error but in another places:

fatal error: concurrent map writes

goroutine 110238 [running]:
runtime.throw(0x1fae448, 0x15)
        /usr/local/go/src/runtime/panic.go:1117 +0x74 fp=0xc00e087968 sp=0xc00e087938 pc=0x43a514
runtime.mapassign(0x251d1a0, 0xc00f290d80, 0xc00e087af8, 0x0)
        /usr/local/go/src/runtime/map.go:585 +0x5c5 fp=0xc00e0879e8 sp=0xc00e087968 pc=0x4120a5
github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0032ed3b0, 0xc00e088a90, 0xc00e087f48, 0x3912ca0, 0x
c003640300)
        /go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1/utils.go:255 +0x8b7 fp=0xc00e087d18 sp=0xc00e0879e8 p
c=0x1be4117
github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0010653b0, 0xc00e088a90, 0xc00e088278, 0x3912ca0, 0x
c00e088200)
        /go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc00e088048 sp=0xc00e087d18 p
c=0x1be3a6b
github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00106be30, 0xc00e088a90, 0xc00e0885a8, 0x3912ca0, 0x
fatal error: concurrent map iteration and map write

goroutine 341238 [running]:
runtime.throw(0x1fe216f, 0x26)
        /usr/local/go/src/runtime/panic.go:1117 +0x74 fp=0xc002f27638 sp=0xc002f27608 pc=0x43a514
runtime.mapiternext(0xc002f278e8)
        /usr/local/go/src/runtime/map.go:858 +0x557 fp=0xc002f276b8 sp=0xc002f27638 pc=0x412d97
github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc00107ee00, 0xc002f28a90, 0xc002f27bb8, 0xc00057ce48,
 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1/utils.go:204 +0x17d fp=0xc002f279e8 sp=0xc002f276b8 p
c=0x1be39dd
github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc000e73d50, 0xc002f28a90, 0xc002f27f48, 0x3912ca0, 0x
c000760a00)
        /go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc002f27d18 sp=0xc002f279e8 pc=0x1be3a6b
github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1.normalizeItem(0xc0001de150, 0xc002f28a90, 0xc002f28278, 0x3912ca0, 0xc003858d00)
        /go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/cache/remotecache/v1/utils.go:206 +0x20b fp=0xc002f28048 sp=0xc002f27d18 pc=0x1be3a6b
fatal error: concurrent map iteration and map write

goroutine 278308 [running]:
runtime.throw(0x1fe216f, 0x26)
        /usr/local/go/src/runtime/panic.go:1117 +0x74 fp=0xc016510788 sp=0xc016510758 pc=0x43a514
runtime.mapiternext(0xc016510a60)
        /usr/local/go/src/runtime/map.go:858 +0x557 fp=0xc016510808 sp=0xc016510788 pc=0x412d97
github.com/docker/docker/vendor/github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01d10eb90, 0x292e918, 0xc0041d6600, 0x7f8a2853b9a8, 0xc00c2
eb728, 0xc02109f470, 0x2, 0x28db008, 0xc02109f428, 0xc019b57690, ...)
        /go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/solver/exporter.go:163 +0x3ef fp=0xc016510ad0 sp=0xc016510808 pc=0x17910af
github.com/docker/docker/vendor/github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01cec54a0, 0x292e918, 0xc0041d6600, 0x7f8a2853b9a8, 0xc00c2
eb728, 0xc02109f470, 0x2, 0x28db008, 0xc02109f428, 0xc01c988e38, ...)
        /go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/solver/exporter.go:135 +0xc76 fp=0xc016510d98 sp=0xc016510ad0 pc=0x1791936
github.com/docker/docker/vendor/github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01cf174f0, 0x292e918, 0xc0041d6600, 0x7f8a2853b9a8, 0xc00c2
eb728, 0xc02109f470, 0x2, 0x28db008, 0xc02109f428, 0xc0028698d8, ...)

@griffous
Copy link
Author

griffous commented Jul 29, 2021

Thanks for sharing @Antiarchitect

I am still seeing panics with our builds, even when using the master buildkit container image, which I assume are still related to this same bug.

Our way of dealing with this is wrap our buildkit commands in retry/timeout blocks, and to only feed <10 targets to buildkit at a time. Together this gives us a CI pipeline which fairly consistently "works", as the best we can do until the bug is resolved.

Here is my most recent run.

time="2021-07-29T10:22:20Z" level=info msg="trying next host - response was http.StatusNotFound"
time="2021-07-29T10:22:20Z" level=info msg="trying next host - response was http.StatusNotFound"
fatal error: concurrent map iteration and map write

goroutine 684401 [running]:
runtime.throw(0x1524096, 0x26)
        /usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc0166e01b0 sp=0xc0166e0180 pc=0x438352
runtime.mapiternext(0xc0166e0488)
        /usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc0166e0230 sp=0xc0166e01b0 pc=0x410d6c
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc010651db0, 0x16ab6a8, 0xc001633a10, 0x7f4fec762c88, 0xc0110c3020, 0xc003a5b728, 0x2, 0x1684c80, 0xc004dc3050, 0xc015bc8ef0, ...)
        /src/solver/exporter.go:163 +0x3ef fp=0xc0166e04f8 sp=0xc0166e0230 pc=0xd398cf
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01c71ee10, 0x16ab6a8, 0xc001633a10, 0x7f4fec762c88, 0xc0110c3020, 0xc003a5b728, 0x2, 0x1684c80, 0xc004dc3050, 0xc004dc3050, ...)
        /src/solver/exporter.go:135 +0xc76 fp=0xc0166e07c0 sp=0xc0166e04f8 pc=0xd3a156
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc009574d20, 0x16ab6a8, 0xc001633a10, 0x7f4fec762c88, 0xc0110c3020, 0xc003a5b728, 0x2, 0x1684c80, 0xc004dc3050, 0x418e13, ...)
        /src/solver/exporter.go:135 +0xc76 fp=0xc0166e0a88 sp=0xc0166e07c0 pc=0xd3a156
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc00d89f220, 0x16ab6a8, 0xc001633a10, 0x7f4fec762c88, 0xc0110c3020, 0xc003a5b728, 0x2, 0x1684c80, 0xc004dc3050, 0xc0080badf0, ...)
        /src/solver/exporter.go:135 +0xc76 fp=0xc0166e0d50 sp=0xc0166e0a88 pc=0xd3a156
:

@Antiarchitect
Copy link

Antiarchitect commented Jul 29, 2021

I bet new errors appear 10 times less often than the original one (my IMHO) but are still present.

@griffous
Copy link
Author

griffous commented Jul 29, 2021

You're probably right - the race is still there; fewer threads simply triggers it less often.

I also see gracefully handled errors like this one on the more recent code. It's better than a panic, but weird all the same.

error: failed to solve: rpc error: code = Unknown desc = failed to compute cache key: failed to get state for index 0 on [oracledb-12.2-base-install 3/3] RUN "/u01/app/oracle"/oraInventory/orainstRoot.sh &&   "/u01/app/oracle/product/12.2.0.1/dbhome_1"/root.sh

@griffous
Copy link
Author

griffous commented Oct 7, 2021

Update for v0.9.1: I can still immediately crash a v0.9.1 buildkit by giving it "too many" parallel builds, so unfortunately this issue still hasn't been found.

time="2021-10-07T06:30:20Z" level=info msg="trying next host - response was http.StatusNotFound"
fatal error: concurrent map iteration and map write

goroutine 142452 [running]:
runtime.throw(0x1526db5, 0x26)
	/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc008024e40 sp=0xc008024e10 pc=0x438352
runtime.mapiternext(0xc008025118)
	/usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc008024ec0 sp=0xc008024e40 pc=0x410d6c
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc0062660a0, 0x16ae9c8, 0xc002c751a0, 0x7fb0995048e8, 0xc001c3b200, 0xc005b217b8, 0x2, 0x1687f00, 0xc005b21698, 0x10, ...)
	/src/solver/exporter.go:163 +0x3ef fp=0xc008025188 sp=0xc008024ec0 pc=0xd39c2f
...

v0.9.1-crash.txt

@tonistiigi
Copy link
Member

@maxlaverse ptal this as well. Related to #2296 (comment) but from that trace looks like the cachekey itself needs some locking (or cloning or a lock in a higher level).

@griffous
Copy link
Author

Thanks, I'll try it out as soon as it's available and get back to you with the results.

@griffous
Copy link
Author

Hi Tonis!

I just gave this a try with our "full build" (47 targets), and the latest master image (which contains the #2410 merge.)

Good news, it didn't panic this time. 🥳🥳🥳
I think that means the root cause for this bug has been fixed!! Thank you so much!

Unfortunately this build is still failing for us, though :/

Run 1 failed with this error after 36 seconds:

------
error: failed to solve: no active sessions

run1.txt

Run 2 failed with this error after 8 minutes:

------
error: failed to solve: failed to compute cache key: failed to get state for index 0 on [base-oracledb-12.2-install-tiadb 2/4] COPY --chown=oracle:oinstall oracledb/install/install_instance.sh /home/oracle/install/

run2.txt

Run 3 failed with a similar error after 7 minutes, so I gather it's the same problem as run 2:

------
error: failed to solve: failed to compute cache key: failed to get state for index 0 on [base-weblogic-12.2.1.1 2/3] COPY --chown=oracle:oinstall weblogic/common/scripts/* /home/oracle/scripts/

run3.txt

And I stopped trying at 3 attempts.

Unless you feel these errors are related, I think we can consider this bug closed.
For the other two errors that we're still hitting, do you recognise either as existing bugs, or should I open new bugs?

@diclophis
Copy link

I am also running into a similar error: failed to solve: no active sessions attempting to run concurrent builds for a multistage dockerfile, i am not able to post a repo case just yet, but I am interested in hearing if there has been any updates to this issue.

@maaarghk
Copy link

maaarghk commented Feb 3, 2023

I'm seeing a segfault in docker:dind 23.0 (Using docker image sha256:29de668202ae1275e9497454093ef68146a0de63e233b613d0edd0e454d9783d for docker:dind with digest docker@sha256:754b29ca7d5656d45d6de119fd36882720c3335341ad44e2c6eb28882339fac7) that looks related maybe

buildx output:

------
 > importing cache manifest from registry.gitlab.com/some-image-that-doesnt-exist:latest:
------
ERROR: failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF
Cleaning up project directory and file based variables

daemon output:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x2e19203]

goroutine 44400 [running]:
github.com/docker/docker/vendor/github.com/moby/buildkit/cache.computeBlobChain.func4.1({0xbdb078, 0xc00c387940})
	/go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/cache/blobs.go:224 +0x10e3
github.com/docker/docker/vendor/github.com/moby/buildkit/util/flightcontrol.(*call).run(0xc0011570e0)
	/go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/util/flightcontrol/flightcontrol.go:121 +0x5e
sync.(*Once).doSlow(0x1d3c485?, 0x2c841ea?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(0xc002f77dd0?, 0xc002f77de8?)
	/usr/local/go/src/sync/once.go:65 +0x1f
created by github.com/docker/docker/vendor/github.com/moby/buildkit/util/flightcontrol.(*call).wait
	/go/src/github.com/docker/docker/vendor/github.com/moby/buildkit/util/flightcontrol/flightcontrol.go:164 +0x44a

@Antiarchitect
Copy link

Antiarchitect commented Feb 3, 2023

@maaarghk Also faced the problem similar to yours on 23.0.0 moby/moby#44918 (comment)

@skokhanovskiy
Copy link

Also faced with this issue.

And we found the way to reproduce it with near 100% probability.

https://gist.github.com/skokhanovskiy/a8f13d3ffe411ff0ece5fde8af9c26f1

The key here is to create multiple similar images with a set of common layers, push it to a registry and pass the list of this images to the cache_from parameter.

After that, building of images with compose becomes unstable because of failure of docker daemon:

fatal error: concurrent map iteration and map write

In this case, to mitigate issue just avoid passing images with common set of layers to the cache_from.

Hope this will help to fix it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants