Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad file descriptor when running with Github Actions #2593

Closed
3 tasks done
Jufik opened this issue Jul 12, 2024 · 12 comments · Fixed by #2629
Closed
3 tasks done

Bad file descriptor when running with Github Actions #2593

Jufik opened this issue Jul 12, 2024 · 12 comments · Fixed by #2629
Assignees
Labels
area/buildkit kind/bug Something isn't working
Milestone

Comments

@Jufik
Copy link

Jufik commented Jul 12, 2024

Contributing guidelines

I've found a bug and checked that ...

  • ... the documentation does not mention anything about my problem
  • ... there are no open or closed issues that are related to my problem

Description

The command :

docker buildx build \
--cache-from type=local,compression-level=2,src=/var/lib/docker/actions/$image \
--cache-to type=local,dest=/var/lib/docker/actions/$image,mode=max \
--file ./Dockerfile \
--tag hello:world
.

Throws a bad file descriptor error

Expected behaviour

The docker image should be built, pushed and cached.
Pin pointing to buildx 0.15.2 in the setup-build-action solves the issue with the exact same action configuration.

Actual behaviour

Caching fails and throws a bad file descriptor error:
ERROR: could not lock /var/lib/docker/actions/$image/index.json.lock: bad file descriptor

When looking into runners /var/lib/docker/actions/$image path, index.json.lock exists with runner rights.

Buildx version

v0.16.0 10c9ff9

Docker info

NA

Builders list

NA

Configuration

FROM nginx
COPY ./index.html /usr/share/nginx/html

Build logs

#12 pushing manifest for [...] 0.6s done
#12 DONE 25.1s

#14 exporting cache to client directory
#14 preparing build cache for export
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 1.0s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.2s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.2s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.2s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.6s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 1.9s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.1s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.1s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.2s done
#14 writing layer sha256:[...]
#14 writing layer sha256:[...] 0.2s done
#14 writing config sha256:[...] 0.1s done
#14 writing cache manifest sha256:[...]
#14 preparing build cache for export 4.9s done
#14 writing cache manifest sha256:[...] 0.1s done
#14 DONE 4.9s
ERROR: could not lock /var/lib/docker/actions/$image/index.json.lock: bad file descriptor
Error: buildx failed with: ERROR: could not lock /var/lib/docker/actions/$image/index.json.lock: bad file descriptor

Additional info

Context:
Context:

  • Buildx command runs within a Github Action runners in K8s.
  • Buildx is installed through docker/setup-buildx-action Github Action, without the version param: Uses latest BuildX version docker-container driver
  • Docker is built and pushed through docker/build-push-action Github Action
@crazy-max
Copy link
Member

Is it a self-hosted runner? Can you show the output of docker info and docker buildx ls?

@ozydingo
Copy link

I have this exact issue. Pinning buildx to 0.15.1 resolves the immediate issue (0.15.2 failed with "could not find version")

@ozydingo
Copy link

ozydingo commented Jul 16, 2024

Self-hosted runner, using cache settings:

      build_cache_from: |
        type=local,src=/home/runner/work/shared/main/${{ matrix.image.name }}
        type=local,src=/home/runner/work/shared/${{ needs.plan.outputs.branch_name }}/${{ matrix.image.name }}
        type=local,src=/home/runner/work/shared/${{ needs.plan.outputs.branch_name }}/${{ matrix.image.name }}
      build_cache_to: 'type=local,mode=max,compression=zstd,compression-level=4,dest=/home/runner/work/shared/${{ needs.plan.outputs.branch_name }}/${{ matrix.image.name }}'

where /home/runner/work/shared is a mounted EFS volume in order to share cache between different workflow runs.

The issue persisted when I used a completely new cache location and when i removed all build_cache_from options, indicating I that the issue is solely in the cache writing step regardless of any current cache.

Removing the build_cache_to which keeping build_cache_from resolved the error (but obviously isn't a long term viable solution)

@ozydingo
Copy link

On 0.16 / latest:

docker info

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Client:
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.0
    Path:     /home/runner/.docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 2
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 091922f03c2762540fd057fba91260237ff86acb
 runc version: v1.1.9-0-gccaecfc
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.92
 Operating System: Ubuntu 22.04.4 LTS (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 30.41GiB
 Name: app-runner-wsz5t-62w5d
 ID: c9fcc801-c6ae-4879-9419-da9d7638033d
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: ***
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

docker buildx ls

NAME/NODE                                           DRIVER/ENDPOINT      STATUS    BUILDKIT               PLATFORMS
builder-45d3ecc3-2ecb-4fd7-887b-99a6733fd6e0*       docker-container                                      
 \_ builder-45d3ecc3-2ecb-4fd7-887b-99a6733fd6e00    \_ buildx-context   running   v0.14.1                linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
buildx-context                                      docker                                                
 \_ buildx-context                                   \_ buildx-context   running   v0.11.7+d3e6c1360f6e   linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
default                                             docker                                                
 \_ default                                          \_ default          running   v0.11.7+d3e6c1360f6e   linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/amd64/v4, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6

@SaschaSchwarze0
Copy link

I reproduced this with pure buildkit. Started to happen with buildkit v0.15.0. I patched v0.15.0 and only reverted github.com/gofrs/flock back to v0.8.1. This resolves it.

In my environment, the cache directory is a volume mount in a pod which is a NFS-based persistent volume.

@crazy-max
Copy link
Member

@SaschaSchwarze0 Thanks for your repro, could you post BuildKit logs in debug please?

@SaschaSchwarze0
Copy link

@SaschaSchwarze0 Thanks for your repro, could you post BuildKit logs in debug please?

One quick clarification in addition to what I wrote above. The patch (= the revert of github.com/gofrs/flock) is necessary for buildctl, not for buildkitd.

With debugging enabled on buildctl, the following stack trace is shown:

error: could not lock /tmp/buildkit-cache/index.json.lock: bad file descriptor
122 v0.15.0 buildctl --debug build --trace=/tmp/buildkit-cache/trace.log --progress=plain --frontend=dockerfile.v0 --opt=filename=Dockerfile --opt=platform=linux/amd64,linux/arm64 --local=context=/workspace/source --local=dockerfile=/workspace/source --output=type=oci,tar=false,dest=/workspace/output-image --export-cache=type=local,mode=max,dest=/tmp/buildkit-cache --import-cache=type=local,src=/tmp/buildkit-cache
github.com/moby/buildkit/client/ociindex.StoreIndex.Put
    /src/client/ociindex/ociindex.go:65
github.com/moby/buildkit/client.(*Client).solve
    /src/client/solve.go:349
github.com/moby/buildkit/client.(*Client).Build
    /src/client/build.go:64
main.buildAction.func5
    /src/cmd/buildctl/build.go:369
golang.org/x/sync/errgroup.(*Group).Go.func1
    /src/vendor/golang.org/x/sync/errgroup/errgroup.go:78
runtime.goexit
    /usr/local/go/src/runtime/asm_arm64.s:1222

@crazy-max
Copy link
Member

@SaschaSchwarze0 Thanks, do you repro with v0.11.0 as well? I wonder if this issue is related to this change gofrs/flock#87

@SaschaSchwarze0
Copy link

SaschaSchwarze0 commented Jul 21, 2024

@SaschaSchwarze0 Thanks, do you repro with v0.11.0 as well? I wonder if this issue is related to this change gofrs/flock#87

buildctl v0.15.0 compiled with github.com/gofrs/flock@v0.11.0: works

So yeah, change must be in https://github.com/gofrs/flock/releases/tag/v0.12.0.

EDIT1: one second, copied the wrong file to my test setup

EDIT2: no, the change you refer to I also found suspicious. But, it looks like this:

buildctl v0.15.0 compiled with github.com/gofrs/flock@v0.10.0: works
buildctl v0.15.0 compiled with github.com/gofrs/flock@v0.11.0: fails

So, must be somehow something in gofrs/flock@v0.10.0...v0.11.0.

@crazy-max
Copy link
Member

Seems to be gofrs/flock@b659e1e where f.flag would not be in read-write mode gofrs/flock@b659e1e#diff-87c2c4fe0fb43f4b38b4bee45c1b54cfb694c61e311f93b369caa44f6c1323ffR192 but read-only gofrs/flock@b659e1e#diff-22145325dded38eb5288ed3321a113d8260ccc70747ee04d4551bfd2fba975fdR69

@crazy-max
Copy link
Member

@SaschaSchwarze0 Should be fixed with moby/buildkit#5183

@jamshid
Copy link

jamshid commented Jul 24, 2024

Is there an easy workaround for docker-ce users on Ubuntu 20.04?
I just did a apt-get upgrade to get a reported docker security fix and now my build scripts are DOA.

$ docker buildx prune -f --verbose
ERROR: bad file descriptor

Ok downgrading the buildkit package seems to fix it. Hopefully that's right approach and next upgrade fixes it.

# apt list --all-versions docker-buildx-plugin
Listing... Done
docker-buildx-plugin/focal,now 0.16.1-1~ubuntu.20.04~focal amd64 [installed]
docker-buildx-plugin/focal 0.15.1-1~ubuntu.20.04~focal amd64
...

# apt-get install docker-buildx-plugin=0.15.1-1~ubuntu.20.04~focal
...
The following packages will be DOWNGRADED:
  docker-buildx-plugin
...
Setting up docker-buildx-plugin (0.15.1-1~ubuntu.20.04~focal)
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/buildkit kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants