Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

CI infrastructure that can run Docker containers in a reliable way #6887

Closed
keegancsmith opened this issue Nov 27, 2019 · 15 comments
Closed
Labels
e2e-tests Everything related to our E2E test suite ops & tools & dev

Comments

@keegancsmith
Copy link
Member

We currently run one docker daemon for the whole cluster (technically two, one for building Docker images and a separate one for e2e tests). This is due to limitations of using kubernetes + buildkite. This means we can only be running 1 e2e test at a time. I propose we have a docker daemon per buildkite agent (this would also be more inline with how it would work on other CI systems). To achieve this I propose we move away from Kubernetes, and instead use a VM per agent.

Previously we attempted docker in docker (failed due to resource constraints) or using the host docker (failed due to old version and contention). Having all the resources of a whole VM is a much more understandable situation, and also allows for simpler debugging (simple ssh).
We currently rely on kubernetes to do autoscaling (with some help from a script we maintain which monitors the build queue). This functionality will likely need to be ported.

RFC 79: Refactor CI Pipeline

@slimsag
Copy link
Member

slimsag commented Nov 27, 2019

By "move away from Kubernetes" do you mean just for the buildkite agents, for buildkite in total, or something else?

Is this something like https://github.com/macstadium/vmkite but for GCP?

@keegancsmith
Copy link
Member Author

just for the buidkite-agents. So we run an instance pool of buildkite-agents. Not just in time spinning up of agents.

@keegancsmith
Copy link
Member Author

Thought about this a bit more, and we have to start caring about a few things we previously didn't like how to handle processes dieing, switching to something clunky like vagrant, etc.

On a whim I checked which version of docker is now running on GCE, and it is very new! This means we can switch back to using the docker on the host. We then just need to ensure we don't overload a single node, and we can scale everything out. Potentially even switching to a daemonset for Buildkite-agent like I tried in the past.

@keegancsmith
Copy link
Member Author

Tried out using the system docker, can't get us to build the server image. It fails at the docker build part with an error that doesn't really say what went wrong: https://buildkite.com/sourcegraph/sourcegraph-e2e/builds/102#d6b331f4-a8bd-4783-b4d0-2705a26ab44f

@ggilmore can you see what is going wrong? Next up I'll try DIND instead.

@keegancsmith
Copy link
Member Author

Out of interest I used a different way to test this without disrupting CI. I modified the agent so we had a new deployment which listened on a different queue:

diff --git a/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml b/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
index 07f900489..9426900fa 100644
--- a/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
+++ b/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
@@ -4,17 +4,17 @@ metadata:
   annotations:
     description: Agent for running CI builds.
   labels:
-    app: buildkite-agent
+    app: buildkite-agent-test
     deploy: buildkite
-  name: buildkite-agent
+  name: buildkite-agent-test
   namespace: buildkite
 spec:
   minReadySeconds: 10
-  replicas: 5
+  replicas: 1
   revisionHistoryLimit: 10
   selector:
     matchLabels:
-      app: buildkite-agent
+      app: buildkite-agent-test
   strategy:
     rollingUpdate:
       maxSurge: 50%
@@ -23,10 +23,12 @@ spec:
   template:
     metadata:
       labels:
-        app: buildkite-agent
+        app: buildkite-agent-test
     spec:
       containers:
       - env:
+        - name: BUILDKITE_AGENT_TAGS
+          value: queue=testing
         - name: BUILDKITE_AGENT_TOKEN
           value: a56589af5160e4c3c9c15323932dff37c5125e701c9369a207
         - name: BUILDKITE_TIMESTAMP_LINES

I then pushed a commit to sourcegraph which specified the use of the queue:

diff --git a/.buildkite/pipeline.e2e.yml b/.buildkite/pipeline.e2e.yml
index 2f5ea90118..b87be6aaaa 100644
--- a/.buildkite/pipeline.e2e.yml
+++ b/.buildkite/pipeline.e2e.yml
@@ -16,6 +16,8 @@ steps:
     VERSION: $BUILDKITE_BUILD_NUMBER
     PUPPETEER_SKIP_CHROMIUM_DOWNLOAD: "true"
   label: ':docker:'
+  agents:
+    queue: testing
 
 - wait
 
@@ -26,9 +28,13 @@ steps:
   env:
     IMAGE: sourcegraph/server:e2e_$BUILDKITE_BUILD_NUMBER
   label: ':chromium:'
+  agents:
+    queue: testing
 
 - wait
 
 - command: docker image rm -f sourcegraph/server:e2e_$BUILDKITE_BUILD_NUMBER
   label: ':sparkles:'
   soft_fail: true
+  agents:
+    queue: testing

@ggilmore
Copy link
Contributor

ggilmore commented Dec 5, 2019

@keegancsmith

I was able to reproduce this with this simple setup:

Dockerfile:

FROM golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f as builder

COPY main.go main.go

RUN go build -o /usr/local/bin/hello main.go

FROM ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d

COPY --from=builder /usr/local/bin/hello /usr/local/bin/hello

CMD ["hello"]

main.go

package main

import "fmt"

func main() {
    fmt.Println("hello world")
}

docker build output on GKE node:

> DOCKER_BUILDKIT=1 docker build -t sg-test .
[+] Building 0.6s (8/11)
 => [internal] load .dockerignore                                                                                                                                                                                                                                                   0.0s
 => => transferring context: 2B                                                                                                                                                                                                                                                     0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                                0.0s
 => => transferring dockerfile: 392B                                                                                                                                                                                                                                                0.0s
 => ERROR [internal] load metadata for docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d                                                                                                                                       0.3s
 => ERROR [internal] load metadata for docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f                                                                                                                               0.2s
 => [internal] load build context                                                                                                                                                                                                                                                   0.0s
 => CANCELED [builder 1/3] FROM docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f                                                                                                                                      0.2s
 => => resolve docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f                                                                                                                                                       0.2s
 => ERROR [internal] helper image for file operations                                                                                                                                                                                                                               0.2s
 => => resolve docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061                                                                                                                                                      0.2s
 => CANCELED [stage-1 1/2] FROM docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d                                                                                                                                              0.2s
 => => resolve docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d                                                                                                                                                               0.2s
------
 > [internal] load metadata for docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d:
------
------
 > [internal] load metadata for docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f:
------
------
 > [internal] helper image for file operations:
------
docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061 not found

This seems to an instance of moby/buildkit#606. It looks like from that thread this was fixed in Docker 19.03 (docker-archive/engine#212, https://docs.docker.com/engine/release-notes/#19030), but the GKE nodes are still running 18.09.07.)

@ggilmore
Copy link
Contributor

ggilmore commented Dec 5, 2019

It seems like running a dind sidecar is the best way out of this - it'll allow us to use whatever docker version we want.

@keegancsmith
Copy link
Member Author

It seems like running a dind sidecar is the best way out of this - it'll allow us to use whatever docker version we want.

Agreed. It will also give us better resource isolation if we choose to run more than one agent per node. I was quite worried about multiple agents + no isolation on the system docker daemon. I will setup a sidecar today. @ggilmore out of interest how do you think that compares to baking in dind to the buildkite-agent image?

@ggilmore
Copy link
Contributor

ggilmore commented Dec 6, 2019

@keegancsmith

I think it'd be easier to administrate a sidecar since it'd be a first-class k8s object - we can apply resource constraints through yaml, view historical resource usage, easily grab logs from it, etc. All of these are pieces of functionality that we'd have to re-implement if it was baked into the same container.

@beyang
Copy link
Member

beyang commented Dec 14, 2019

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.11 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@keegancsmith keegancsmith removed this from the 3.11 milestone Dec 17, 2019
@keegancsmith
Copy link
Member Author

I kept running into roadblocks on this + it has a very slow turnaround time for testing changes. For context I tried introducing a docker wrapper script which would add to the command line arguments the current cgroup as --cgroup-parent: https://gist.github.com/keegancsmith/74ee9c5dcc887a9b81ab255baecb8b55

This allows us to use the host docker daemon and get the resource isolation we want (pretty nifty). I had to disable DOCKER_BUILDKIT since the docker daemons on our nodes have a bug in it preventing it building our Dockerfiles.

I'm gonna timebox 1 more hour on this. If that fails, I am gonna defer working on this until a later stage after we have discussed this further.

Note: To get this to work, I also had to
I am gonna try it out one more tim

@nicksnyder nicksnyder added e2e-tests Everything related to our E2E test suite team/core-services labels Jan 20, 2020
@nicksnyder nicksnyder added this to the 3.14 milestone Jan 27, 2020
@tsenart tsenart modified the milestones: 3.14, Backlog Feb 20, 2020
@tsenart
Copy link
Contributor

tsenart commented Feb 20, 2020

@nicksnyder: Did you add this to the 3.14 milestone intentionally? I don't believe we have the bandwidth this milestone.

@nicksnyder
Copy link
Contributor

When we were talking about e2e issues and looking for an owner, this was one of the issues that I wanted that person to take a look at. Given the e2e doesn’t currently feel like a pain point and customer issues are more important, I am ok moving to backlog.

@keegancsmith keegancsmith removed their assignment Feb 20, 2020
@keegancsmith
Copy link
Member Author

@sourcegraph/distribution I believe you have been looking into this so changing labels.

@slimsag slimsag changed the title ci: Horizontally scaleable CI ci: Horizontally scaleable and reliable CI that can run Docker containers Jun 29, 2020
@slimsag slimsag removed this from the Backlog milestone Jul 1, 2020
@slimsag slimsag changed the title ci: Horizontally scaleable and reliable CI that can run Docker containers CI infrastructure that can run Docker containers in a reliable way Jul 1, 2020
@slimsag
Copy link
Member

slimsag commented Jul 20, 2020

@slimsag slimsag closed this as completed Jul 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
e2e-tests Everything related to our E2E test suite ops & tools & dev
Projects
None yet
Development

No branches or pull requests

7 participants