CI infrastructure that can run Docker containers in a reliable way #6887

keegancsmith · 2019-11-27T08:44:03Z

We currently run one docker daemon for the whole cluster (technically two, one for building Docker images and a separate one for e2e tests). This is due to limitations of using kubernetes + buildkite. This means we can only be running 1 e2e test at a time. I propose we have a docker daemon per buildkite agent (this would also be more inline with how it would work on other CI systems). To achieve this I propose we move away from Kubernetes, and instead use a VM per agent.

Previously we attempted docker in docker (failed due to resource constraints) or using the host docker (failed due to old version and contention). Having all the resources of a whole VM is a much more understandable situation, and also allows for simpler debugging (simple ssh).
We currently rely on kubernetes to do autoscaling (with some help from a script we maintain which monitors the build queue). This functionality will likely need to be ported.

RFC 79: Refactor CI Pipeline

slimsag · 2019-11-27T12:07:22Z

By "move away from Kubernetes" do you mean just for the buildkite agents, for buildkite in total, or something else?

Is this something like https://github.com/macstadium/vmkite but for GCP?

keegancsmith · 2019-11-27T14:03:55Z

just for the buidkite-agents. So we run an instance pool of buildkite-agents. Not just in time spinning up of agents.

keegancsmith · 2019-12-04T17:50:18Z

Thought about this a bit more, and we have to start caring about a few things we previously didn't like how to handle processes dieing, switching to something clunky like vagrant, etc.

On a whim I checked which version of docker is now running on GCE, and it is very new! This means we can switch back to using the docker on the host. We then just need to ensure we don't overload a single node, and we can scale everything out. Potentially even switching to a daemonset for Buildkite-agent like I tried in the past.

keegancsmith · 2019-12-05T12:38:52Z

Tried out using the system docker, can't get us to build the server image. It fails at the docker build part with an error that doesn't really say what went wrong: https://buildkite.com/sourcegraph/sourcegraph-e2e/builds/102#d6b331f4-a8bd-4783-b4d0-2705a26ab44f

@ggilmore can you see what is going wrong? Next up I'll try DIND instead.

keegancsmith · 2019-12-05T12:42:20Z

Out of interest I used a different way to test this without disrupting CI. I modified the agent so we had a new deployment which listened on a different queue:

diff --git a/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml b/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
index 07f900489..9426900fa 100644
--- a/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
+++ b/kubernetes/ci/buildkite/buildkite-agent/buildkite-agent.Deployment.yaml
@@ -4,17 +4,17 @@ metadata:
   annotations:
     description: Agent for running CI builds.
   labels:
-    app: buildkite-agent
+    app: buildkite-agent-test
     deploy: buildkite
-  name: buildkite-agent
+  name: buildkite-agent-test
   namespace: buildkite
 spec:
   minReadySeconds: 10
-  replicas: 5
+  replicas: 1
   revisionHistoryLimit: 10
   selector:
     matchLabels:
-      app: buildkite-agent
+      app: buildkite-agent-test
   strategy:
     rollingUpdate:
       maxSurge: 50%
@@ -23,10 +23,12 @@ spec:
   template:
     metadata:
       labels:
-        app: buildkite-agent
+        app: buildkite-agent-test
     spec:
       containers:
       - env:
+        - name: BUILDKITE_AGENT_TAGS
+          value: queue=testing
         - name: BUILDKITE_AGENT_TOKEN
           value: a56589af5160e4c3c9c15323932dff37c5125e701c9369a207
         - name: BUILDKITE_TIMESTAMP_LINES

I then pushed a commit to sourcegraph which specified the use of the queue:

diff --git a/.buildkite/pipeline.e2e.yml b/.buildkite/pipeline.e2e.yml
index 2f5ea90118..b87be6aaaa 100644
--- a/.buildkite/pipeline.e2e.yml
+++ b/.buildkite/pipeline.e2e.yml
@@ -16,6 +16,8 @@ steps:
     VERSION: $BUILDKITE_BUILD_NUMBER
     PUPPETEER_SKIP_CHROMIUM_DOWNLOAD: "true"
   label: ':docker:'
+  agents:
+    queue: testing
 
 - wait
 
@@ -26,9 +28,13 @@ steps:
   env:
     IMAGE: sourcegraph/server:e2e_$BUILDKITE_BUILD_NUMBER
   label: ':chromium:'
+  agents:
+    queue: testing
 
 - wait
 
 - command: docker image rm -f sourcegraph/server:e2e_$BUILDKITE_BUILD_NUMBER
   label: ':sparkles:'
   soft_fail: true
+  agents:
+    queue: testing

ggilmore · 2019-12-05T23:09:43Z

@keegancsmith

I was able to reproduce this with this simple setup:

Dockerfile:

FROM golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f as builder

COPY main.go main.go

RUN go build -o /usr/local/bin/hello main.go

FROM ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d

COPY --from=builder /usr/local/bin/hello /usr/local/bin/hello

CMD ["hello"]

main.go

package main

import "fmt"

func main() {
    fmt.Println("hello world")
}

docker build output on GKE node:

> DOCKER_BUILDKIT=1 docker build -t sg-test .
[+] Building 0.6s (8/11)
 => [internal] load .dockerignore                                                                                                                                                                                                                                                   0.0s
 => => transferring context: 2B                                                                                                                                                                                                                                                     0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                                0.0s
 => => transferring dockerfile: 392B                                                                                                                                                                                                                                                0.0s
 => ERROR [internal] load metadata for docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d                                                                                                                                       0.3s
 => ERROR [internal] load metadata for docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f                                                                                                                               0.2s
 => [internal] load build context                                                                                                                                                                                                                                                   0.0s
 => CANCELED [builder 1/3] FROM docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f                                                                                                                                      0.2s
 => => resolve docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f                                                                                                                                                       0.2s
 => ERROR [internal] helper image for file operations                                                                                                                                                                                                                               0.2s
 => => resolve docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061                                                                                                                                                      0.2s
 => CANCELED [stage-1 1/2] FROM docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d                                                                                                                                              0.2s
 => => resolve docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d                                                                                                                                                               0.2s
------
 > [internal] load metadata for docker.io/library/ubuntu:18.04@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d:
------
------
 > [internal] load metadata for docker.io/library/golang:1.13.4-buster@sha256:8081f3c8700ee81291688ded9ba63e551a6290ac4617f0e2c3fd1a6487569f3f:
------
------
 > [internal] helper image for file operations:
------
docker.io/docker/dockerfile-copy:v0.1.9@sha256:e8f159d3f00786604b93c675ee2783f8dc194bb565e61ca5788f6a6e9d304061 not found

This seems to an instance of moby/buildkit#606. It looks like from that thread this was fixed in Docker 19.03 (docker-archive/engine#212, https://docs.docker.com/engine/release-notes/#19030), but the GKE nodes are still running 18.09.07.)

ggilmore · 2019-12-05T23:23:11Z

It seems like running a dind sidecar is the best way out of this - it'll allow us to use whatever docker version we want.

keegancsmith · 2019-12-06T05:33:17Z

It seems like running a dind sidecar is the best way out of this - it'll allow us to use whatever docker version we want.

Agreed. It will also give us better resource isolation if we choose to run more than one agent per node. I was quite worried about multiple agents + no isolation on the system docker daemon. I will setup a sidecar today. @ggilmore out of interest how do you think that compares to baking in dind to the buildkite-agent image?

ggilmore · 2019-12-06T18:27:12Z

@keegancsmith

I think it'd be easier to administrate a sidecar since it'd be a first-class k8s object - we can apply resource constraints through yaml, view historical resource usage, easily grab logs from it, etc. All of these are pieces of functionality that we'd have to re-implement if it was baked into the same container.

beyang · 2019-12-14T00:11:45Z

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.11 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

keegancsmith · 2019-12-17T11:36:45Z

I kept running into roadblocks on this + it has a very slow turnaround time for testing changes. For context I tried introducing a docker wrapper script which would add to the command line arguments the current cgroup as --cgroup-parent: https://gist.github.com/keegancsmith/74ee9c5dcc887a9b81ab255baecb8b55

This allows us to use the host docker daemon and get the resource isolation we want (pretty nifty). I had to disable DOCKER_BUILDKIT since the docker daemons on our nodes have a bug in it preventing it building our Dockerfiles.

I'm gonna timebox 1 more hour on this. If that fails, I am gonna defer working on this until a later stage after we have discussed this further.

Note: To get this to work, I also had to
I am gonna try it out one more tim

tsenart · 2020-02-20T11:23:03Z

@nicksnyder: Did you add this to the 3.14 milestone intentionally? I don't believe we have the bandwidth this milestone.

nicksnyder · 2020-02-20T15:20:00Z

When we were talking about e2e issues and looking for an owner, this was one of the issues that I wanted that person to take a look at. Given the e2e doesn’t currently feel like a pain point and customer issues are more important, I am ok moving to backlog.

keegancsmith · 2020-04-16T10:58:04Z

@sourcegraph/distribution I believe you have been looking into this so changing labels.

slimsag · 2020-07-20T17:42:50Z

Closing in favor of https://github.com/sourcegraph/sourcegraph/issues/12101

keegancsmith added this to the 3.11 milestone Nov 27, 2019

keegancsmith assigned keegancsmith and ggilmore Nov 27, 2019

keegancsmith mentioned this issue Nov 27, 2019

Core Services: 3.11 tracking issue #6724

Closed

16 tasks

keegancsmith removed this from the 3.11 milestone Dec 17, 2019

nicksnyder added e2e-tests Everything related to our E2E test suite team/core-services labels Jan 20, 2020

nicksnyder unassigned ggilmore Jan 27, 2020

nicksnyder added this to the 3.14 milestone Jan 27, 2020

tsenart modified the milestones: 3.14, Backlog Feb 20, 2020

nicksnyder added the ops & tools & dev label Feb 20, 2020

keegancsmith removed their assignment Feb 20, 2020

keegancsmith added team/distribution and removed team/core-services labels Apr 16, 2020

keegancsmith assigned uwedeportivo Apr 16, 2020

slimsag changed the title ~~ci: Horizontally scaleable CI~~ ci: Horizontally scaleable and reliable CI that can run Docker containers Jun 29, 2020

slimsag removed this from the Backlog milestone Jul 1, 2020

slimsag changed the title ~~ci: Horizontally scaleable and reliable CI that can run Docker containers~~ CI infrastructure that can run Docker containers in a reliable way Jul 1, 2020

slimsag unassigned uwedeportivo Jul 1, 2020

slimsag mentioned this issue Jul 20, 2020

Bare-metal Buildkite agents capable of running Docker and VMs #12101

Closed

1 task

slimsag closed this as completed Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI infrastructure that can run Docker containers in a reliable way #6887

CI infrastructure that can run Docker containers in a reliable way #6887

keegancsmith commented Nov 27, 2019

slimsag commented Nov 27, 2019

keegancsmith commented Nov 27, 2019

keegancsmith commented Dec 4, 2019

keegancsmith commented Dec 5, 2019

keegancsmith commented Dec 5, 2019

ggilmore commented Dec 5, 2019 •

edited

Loading

ggilmore commented Dec 5, 2019

keegancsmith commented Dec 6, 2019

ggilmore commented Dec 6, 2019

beyang commented Dec 14, 2019

keegancsmith commented Dec 17, 2019

tsenart commented Feb 20, 2020

nicksnyder commented Feb 20, 2020

keegancsmith commented Apr 16, 2020

slimsag commented Jul 20, 2020

CI infrastructure that can run Docker containers in a reliable way #6887

CI infrastructure that can run Docker containers in a reliable way #6887

Comments

keegancsmith commented Nov 27, 2019

slimsag commented Nov 27, 2019

keegancsmith commented Nov 27, 2019

keegancsmith commented Dec 4, 2019

keegancsmith commented Dec 5, 2019

keegancsmith commented Dec 5, 2019

ggilmore commented Dec 5, 2019 • edited Loading

ggilmore commented Dec 5, 2019

keegancsmith commented Dec 6, 2019

ggilmore commented Dec 6, 2019

beyang commented Dec 14, 2019

keegancsmith commented Dec 17, 2019

tsenart commented Feb 20, 2020

nicksnyder commented Feb 20, 2020

keegancsmith commented Apr 16, 2020

slimsag commented Jul 20, 2020

ggilmore commented Dec 5, 2019 •

edited

Loading