Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DockerMachine Runner stuck (sudo: docker: command not found) #470

Closed
gerbenoostra opened this issue Apr 15, 2022 · 5 comments
Closed

DockerMachine Runner stuck (sudo: docker: command not found) #470

gerbenoostra opened this issue Apr 15, 2022 · 5 comments

Comments

@gerbenoostra
Copy link

gerbenoostra commented Apr 15, 2022

I'm on 4.41.1 but my runner gets stuck:

Running with gitlab-runner 14.8.2 (c6e7e194)
  on cluster-docker-default _ey-_Zvr
Preparing the "docker+machine" executor

I have the following variable definitions:

gitlab_runner_version               = "14.8.2"
gitlab_runners_name                 = "cluster-docker-default"
gitlab_url                          = "https://gitlab.com"
gitlab_build_image                  = "docker:20.10.12"
gitlab_docker_machine_version       = "0.16.2-gitlab.12"
gitlab_runner_instance_type         = "t3a.nano"
gitlab_docker_machine_instance_type = "t3a.medium"
gitlab_runner_ami_filter = {
  name = ["amzn2-ami-hvm-2.*-x86_64-ebs"]
}

With following setup:

locals {

    gitlab_env = "${var.application}-spot-runners-${var.environment.shortname}"
}

module "gitlab-runner" {
  # https://registry.terraform.io/modules/npalm/gitlab-runner/aws/
  source  = "npalm/gitlab-runner/aws"
  version = "4.41.1"

  aws_region  = var.region
  environment = local.gitlab_env

  vpc_id                   = module.vpc.vpc_id
  subnet_ids_gitlab_runner = module.vpc.private_subnet_ids
  subnet_id_runners        = element(module.vpc.public_subnet_ids, 0)
  metrics_autoscaling      = ["GroupDesiredCapacity", "GroupInServiceCapacity"]

  runners_name             = var.gitlab_runners_name
  runners_gitlab_url       = var.gitlab_url
  enable_runner_ssm_access = true

  cache_bucket_name_include_account_id = false
  cache_bucket_prefix                  = local.gitlab_env
  cache_expiration_days                = 8

  gitlab_runner_security_group_ids = [data.aws_security_group.default.id]

  gitlab_runner_version  = var.gitlab_runner_version
  runners_image          = var.gitlab_build_image
  docker_machine_version = var.gitlab_docker_machine_version

  instance_type                = var.gitlab_runner_instance_type
  docker_machine_instance_type = var.gitlab_docker_machine_instance_type

  ami_filter        = var.gitlab_runner_ami_filter
  runner_ami_filter = var.gitlab_docker_machine_ami_filter

  runner_instance_spot_price    = "on-demand-price"
  docker_machine_spot_price_bid = "on-demand-price"

  gitlab_runner_registration_config = {
    registration_token = var.gitlab_registration_token
    tag_list           = "docker_spot_runner"
    description        = "runner default - auto"
    locked_to_project  = "true"
    run_untagged       = "true"
    maximum_timeout    = "3600"
  }

  runners_privileged         = "true"
  runners_additional_volumes = ["/certs/client"]

  runners_volumes_tmpfs = [
    {
      volume  = "/var/opt/cache",
      options = "rw,noexec"
    }
  ]

  runners_services_volumes_tmpfs = [
    {
      volume  = "/var/lib/mysql",
      options = "rw,noexec"
    }
  ]

  # scheduling for the agent / gitlab runner
  enable_schedule = true
  schedule_config = {
    scale_in_recurrence  = "0 18 * * 1-5"
    scale_in_count       = 0
    scale_out_recurrence = "0 8 * * 1-5"
    scale_out_count      = 1
  }
  # recurrence is in UTC timezone(!)

  tags        = merge(local.tags_gitlab, { gitlab : true })
  agent_tags  = merge(local.tags_gitlab, { gitlab : "agent" })
  runner_tags = merge(local.tags_gitlab, { gitlab : "runner" })

  runners_monitoring = true
  runners_concurrent = 2  # upper limit of runners
  runners_idle_time  = 60 # in seconds
  runners_idle_count = 0  # minimum nr of idle runners
  cache_shared       = true

  enable_cloudwatch_logging = true
  cloudwatch_logging_retention_in_days         = 90
  runners_pre_build_script = <<EOT
  '''
  echo 'multiline 1'
  echo 'multiline 2'
  '''
  EOT

  runners_post_build_script = "\"echo 'single line'\""
}

r
resource "null_resource" "cancel_spot_requests" {
  # Cancel active and open spot requests, terminate instances
  triggers = {
    environment = local.gitlab_env
  }

  provisioner "local-exec" {
    when    = destroy
    command = "AWS_REGION=us-east-1 .terraform/modules/gitlab-runner/bin/cancel-spot-instances.sh ${self.triggers.environment}"
  }
}

The logs (at loggroup instance_id/messages), shows (prefixed with gitlab-runner: :

{"driver":"amazonec2","level":"info","msg":"(runner-ey-zvr-runner-1650016364-03cafbdc) Created spot instance request sir-r4x6bbzn","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T09:53:02Z"}
{"driver":"amazonec2","level":"info","msg":"Waiting for machine to be running, this may take a few minutes...","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T09:53:02Z"}
{"driver":"amazonec2","level":"info","msg":"Detecting operating system of created instance...","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T09:53:02Z"}
 {"driver":"amazonec2","level":"info","msg":"Waiting for SSH to be available...","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T09:53:02Z"}
{"driver":"amazonec2","level":"info","msg":"Detecting the provisioner...","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T09:53:28Z"}
 {"driver":"amazonec2","level":"info","msg":"Provisioning with ubuntu(systemd)...","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T09:53:28Z"}
{"creating":1,"idle":0,"idleCount":0,"idleCountMin":0,"idleScaleFactor":0,"level":"info","maxMachineCreate":0,"maxMachines":0,"msg":"IdleCount is set to 0 so the machine will be created on demand in job context","removing":0,"runner":"_ey-_Zvr","time":"2022-04-15T09:53:3..}
{"driver":"amazonec2","level":"info","msg":"Installing Docker...","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T09:54:02Z"}

And then shows (repeatedly):

{"driver":"amazonec2","level":"info","msg":"Error getting SSH command to check if the daemon is up: ssh command error:","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T10:04:56Z"}
{"driver":"amazonec2","level":"info","msg":"command : sudo docker version","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T10:04:56Z"}
{"driver":"amazonec2","level":"info","msg":"err : exit status 1","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T10:04:56Z"}
{"driver":"amazonec2","level":"info","msg":"output : sudo: docker: command not found","name":"runner-ey-zvr-runner-1650016364-03cafbdc","operation":"create","time":"2022-04-15T10:04:56Z"}

The /user-data logs shows (amongst others):

Installing : git-core-2.32.0-1.amzn2.0.1.x86_64                           1/8 
Installing : git-core-doc-2.32.0-1.amzn2.0.1.noarch                       2/8 
Installing : 1:perl-Error-0.17020-2.amzn2.noarch                          3/8 
Installing : 1:emacs-filesystem-27.2-4.amzn2.0.1.noarch                   4/8 
Installing : perl-TermReadKey-2.30-20.amzn2.0.2.x86_64                    5/8 
Installing : perl-Git-2.32.0-1.amzn2.0.1.noarch                           6/8 
Installing : git-2.32.0-1.amzn2.0.1.x86_64 7/8 
Installing : gitlab-runner-14.8.2-1.x86_64 8/8 
GitLab Runner: creating gitlab-runner...
Home directory skeleton not used
gitlab-runner: the service is not installed
gitlab-ci-multi-runner: the service is not installed
INFO: Docker installation not found, skipping clear-docker-cache
  Verifying  : perl-TermReadKey-2.30-20.amzn2.0.2.x86_64                    1/8 
  Verifying  : gitlab-runner-14.8.2-1.x86_64                                2/8 
  Verifying  : git-core-doc-2.32.0-1.amzn2.0.1.noarch                       3/8 
  Verifying  : perl-Git-2.32.0-1.amzn2.0.1.noarch                           4/8 
  Verifying  : 1:emacs-filesystem-27.2-4.amzn2.0.1.noarch                   5/8 
  Verifying  : git-2.32.0-1.amzn2.0.1.x86_64                                6/8 
  Verifying  : git-core-2.32.0-1.amzn2.0.1.x86_64                           7/8 
  Verifying  : 1:perl-Error-0.17020-2.amzn2.noarch                          8/8 
Installed: gitlab-runner.x86_64 0:14.8.2-1 

Note the Docker installation not found.

Any ideas?

@gerbenoostra gerbenoostra changed the title Runner stuck DockerMachine Runner stuck Apr 15, 2022
@gerbenoostra gerbenoostra changed the title DockerMachine Runner stuck DockerMachine Runner stuck (sudo: docker: command not found) Apr 19, 2022
@gerbenoostra
Copy link
Author

I'm using docker:20.10.12 as gitlab build image, which is mentioned in the following issues as being problematic:

@gerbenoostra
Copy link
Author

Following another proposal of

  userdata_pre_install = <<EOT
echo "=========="
echo "userdata pre script"
cat /var/lib/cloud/instance/scripts/part-001
echo "=========="
cat > /etc/gitlab-runner/docker_rebuild.txt <<- EOF
runcmd:
  - |
    while sleep 1; do
      if [ -e /etc/systemd/system/docker.service.d/10-machine.conf ]; then
        sleep 15
        systemctl restart docker
        break
      fi
    done &"
EOF
echo "=========="
EOT
  userdata_post_install = "echo 'userdata post install'"

  docker_machine_options = [
    "amazonec2-userdata=/etc/gitlab-runner/docker_rebuild.txt"
  ]

Though this succeeds to run on the gitlab runner agent (the ec2 instance gets initiated), it did not resolve the issue.
I also don't see logs of the spawned build runners on CloudWatch, thus can't diagnose further.

@gerbenoostra
Copy link
Author

In another attempt, I reverted my changes regarding images & instance types to use version 4.41.1 as specified in the runner-default example. Thus kept everything at the default.

However, that gave the same issue:
In gitlab one sees a stuck Preparing the "docker+machine" executor,
in the cloudwatch logs I see the Error getting SSH command to check if the daemon is up with sudo: docker: command not found.

@npalm
Copy link
Collaborator

npalm commented Apr 25, 2022

Will run a check later this week, sorry really busy.

@gerbenoostra
Copy link
Author

I had to change

  subnet_ids_gitlab_runner = module.vpc.private_subnet_ids
  subnet_id_runners        = element(module.vpc.public_subnet_ids, 0)

into

  subnet_ids_gitlab_runner = module.vpc.private_subnets
  subnet_id_runners        = element(module.vpc.private_subnets, 0)

as apparently putting the runners in the public_subnet caused them to have no (outbound) internet connection. Thus it could not actually install Docker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants