-
Notifications
You must be signed in to change notification settings - Fork 1.3k
CI infrastructure that can run Docker containers in a reliable way #6887
Comments
By "move away from Kubernetes" do you mean just for the buildkite agents, for buildkite in total, or something else? Is this something like https://github.com/macstadium/vmkite but for GCP? |
just for the buidkite-agents. So we run an instance pool of buildkite-agents. Not just in time spinning up of agents. |
Thought about this a bit more, and we have to start caring about a few things we previously didn't like how to handle processes dieing, switching to something clunky like vagrant, etc. On a whim I checked which version of docker is now running on GCE, and it is very new! This means we can switch back to using the docker on the host. We then just need to ensure we don't overload a single node, and we can scale everything out. Potentially even switching to a daemonset for Buildkite-agent like I tried in the past. |
Tried out using the system docker, can't get us to build the server image. It fails at the docker build part with an error that doesn't really say what went wrong: https://buildkite.com/sourcegraph/sourcegraph-e2e/builds/102#d6b331f4-a8bd-4783-b4d0-2705a26ab44f @ggilmore can you see what is going wrong? Next up I'll try DIND instead. |
Out of interest I used a different way to test this without disrupting CI. I modified the agent so we had a new deployment which listened on a different queue: diff --git a/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml b/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
index 07f900489..9426900fa 100644
--- a/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
+++ b/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
@@ -4,17 +4,17 @@ metadata:
annotations:
description: Agent for running CI builds.
labels:
- app: buildkite-agent
+ app: buildkite-agent-test
deploy: buildkite
- name: buildkite-agent
+ name: buildkite-agent-test
namespace: buildkite
spec:
minReadySeconds: 10
- replicas: 5
+ replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
- app: buildkite-agent
+ app: buildkite-agent-test
strategy:
rollingUpdate:
maxSurge: 50%
@@ -23,10 +23,12 @@ spec:
template:
metadata:
labels:
- app: buildkite-agent
+ app: buildkite-agent-test
spec:
containers:
- env:
+ - name: BUILDKITE_AGENT_TAGS
+ value: queue=testing
- name: BUILDKITE_AGENT_TOKEN
value: a56589af5160e4c3c9c15323932dff37c5125e701c9369a207
- name: BUILDKITE_TIMESTAMP_LINES I then pushed a commit to sourcegraph which specified the use of the queue: diff --git a/.buildkite/pipeline.e2e.yml b/.buildkite/pipeline.e2e.yml
index 2f5ea90118..b87be6aaaa 100644
--- a/.buildkite/pipeline.e2e.yml
+++ b/.buildkite/pipeline.e2e.yml
@@ -16,6 +16,8 @@ steps:
VERSION: $BUILDKITE_BUILD_NUMBER
PUPPETEER_SKIP_CHROMIUM_DOWNLOAD: "true"
label: ':docker:'
+ agents:
+ queue: testing
- wait
@@ -26,9 +28,13 @@ steps:
env:
IMAGE: sourcegraph/server:e2e_$BUILDKITE_BUILD_NUMBER
label: ':chromium:'
+ agents:
+ queue: testing
- wait
- command: docker image rm -f sourcegraph/server:e2e_$BUILDKITE_BUILD_NUMBER
label: ':sparkles:'
soft_fail: true
+ agents:
+ queue: testing |
I was able to reproduce this with this simple setup: Dockerfile: FROM golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f as builder
COPY main.go main.go
RUN go build -o /usr/local/bin/hello main.go
FROM ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d
COPY --from=builder /usr/local/bin/hello /usr/local/bin/hello
CMD ["hello"] main.go package main
import "fmt"
func main() {
fmt.Println("hello world")
} docker build output on GKE node: > DOCKER_BUILDKIT=1 docker build -t sg-test .
[+] Building 0.6s (8/11)
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 392B 0.0s
=> ERROR [internal] load metadata for docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d 0.3s
=> ERROR [internal] load metadata for docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f 0.2s
=> [internal] load build context 0.0s
=> CANCELED [builder 1/3] FROM docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f 0.2s
=> => resolve docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f 0.2s
=> ERROR [internal] helper image for file operations 0.2s
=> => resolve docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061 0.2s
=> CANCELED [stage-1 1/2] FROM docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d 0.2s
=> => resolve docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d 0.2s
------
> [internal] load metadata for docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d:
------
------
> [internal] load metadata for docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f:
------
------
> [internal] helper image for file operations:
------
docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061 not found This seems to an instance of moby/buildkit#606. It looks like from that thread this was fixed in Docker 19.03 (docker-archive/engine#212, https://docs.docker.com/engine/release-notes/#19030), but the GKE nodes are still running |
It seems like running a |
Agreed. It will also give us better resource isolation if we choose to run more than one agent per node. I was quite worried about multiple agents + no isolation on the system docker daemon. I will setup a sidecar today. @ggilmore out of interest how do you think that compares to baking in dind to the buildkite-agent image? |
I think it'd be easier to administrate a sidecar since it'd be a first-class k8s object - we can apply resource constraints through yaml, view historical resource usage, easily grab logs from it, etc. All of these are pieces of functionality that we'd have to re-implement if it was baked into the same container. |
Dear all, This is your release captain speaking. 🚂🚂🚂 Branch cut for the 3.11 release is scheduled for tomorrow. Is this issue / PR going to make it in time? Please change the milestone accordingly. Thank you |
I kept running into roadblocks on this + it has a very slow turnaround time for testing changes. For context I tried introducing a This allows us to use the host docker daemon and get the resource isolation we want (pretty nifty). I had to disable I'm gonna timebox 1 more hour on this. If that fails, I am gonna defer working on this until a later stage after we have discussed this further. Note: To get this to work, I also had to |
@nicksnyder: Did you add this to the 3.14 milestone intentionally? I don't believe we have the bandwidth this milestone. |
When we were talking about e2e issues and looking for an owner, this was one of the issues that I wanted that person to take a look at. Given the e2e doesn’t currently feel like a pain point and customer issues are more important, I am ok moving to backlog. |
@sourcegraph/distribution I believe you have been looking into this so changing labels. |
Closing in favor of https://github.com/sourcegraph/sourcegraph/issues/12101 |
We currently run one docker daemon for the whole cluster (technically two, one for building Docker images and a separate one for e2e tests). This is due to limitations of using kubernetes + buildkite. This means we can only be running 1 e2e test at a time. I propose we have a docker daemon per buildkite agent (this would also be more inline with how it would work on other CI systems). To achieve this I propose we move away from Kubernetes, and instead use a VM per agent.
Previously we attempted docker in docker (failed due to resource constraints) or using the host docker (failed due to old version and contention). Having all the resources of a whole VM is a much more understandable situation, and also allows for simpler debugging (simple ssh).
We currently rely on kubernetes to do autoscaling (with some help from a script we maintain which monitors the build queue). This functionality will likely need to be ported.
RFC 79: Refactor CI Pipeline
The text was updated successfully, but these errors were encountered: