-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot pass nodeSelector, tolarations and resources in containerMode: kubernetes #1730
Comments
@aacecandev Hey! Are you saying that the job pod created when you're using the kubernetes container mode in addition to the runner pod is missing those fields? Can I take it as the runner pod have all the expected fields but the job pod is not? |
Hi @mumoshu. I'm working with @aacecandev and can confirm that is exactly what is happening. |
Hi @mumoshu, the workflow pod doesn't have these values. The pod provided by the runnerdeployment is executing with the fields configured as expected, but the workflow pod that the hook launches doesn't have the same fields, we need to execute the workflow pod with these particular fields to schedule the pod in a GPU node. |
That's right @mumoshu, the above response describes exactly the buggy behavior. We think that the problem could be located in the runnerdeployment_controller not receiving those fields correctly, but we are not totally sure since we've been busy rolling back to from k8s 1.23 to 1.22 and trying to achieve pass-through of the GPU from GKE to DinD (which we've achieved successfully) |
looking forward to a have nodeSelector on RunnerSet as well, so I can assign pods to my runners nodes. |
Hi @mumoshu , I recently started to use ARC and I'm noticing the same that i cant seem to use ACR version - v0.25.2 Error:
|
Hi, noticed why it was rejecting as i was passing an incorrect label. Please ignore |
Hey everyone- sorry for the delayed response. Workflow job pods are created by the runner container hooks, which are currently owned by GitHub. ARC is using the hooks without any modifications. AFAIK, the pod spec of the workflow job pods are generated in https://github.com/actions/runner-container-hooks/blob/d988d965c57642f972246e28567301f6b4c054e1/packages/k8s/src/hooks/run-container-step.ts#L78-L110 within the "k8s" runner container hooks, which we embed into our runner images via https://github.com/actions-runner-controller/actions-runner-controller/blob/18077a1e83e346a5c3f3ae57ae9b8792ceb7c292/runner/actions-runner.dockerfile#L87-L90 That said, I guess the right way forward would be to file a feature request to the runner container hooks project too so that we can collaborate on a potential solution. Maybe we can fork/modify the hooks to accept additional envvars or a config file to customize some pod and container fields of workflow job pods. Maybe they can do it for us, which is ideal because then we don't need to repeatedly rebase our fork onto their work. |
Yes, I'm pretty sure that the right solution is
A couple of other important examples of fields you want to set are the security context and service account for the Job. |
Would like to mention volumeMounts as well as another example of fields you'd like to set! |
I've taken a crack at this problem here: actions/runner-container-hooks#50 It solves it by providing functionality to pass a template file for the newly created pod with |
I think I got a workaround (not the prettiest solution, but at least it works): I simply patched the RunnerDeployment resource that manages the runner pod adding tolerations and nodeSelector values. You can do this by using kubectl patch or (my case) using kubectl_manifest Terraform resource: resource "kubectl_manifest" "github_actions_runner_patch" {
depends_on = [ helm_release.github_actions_runner ]
override_namespace = "actions-runner-system"
yaml_body = <<YAML
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: github-agent-runner
spec:
template:
spec:
nodeSelector:
node: github-actions-runner
tolerations:
- key: github-actions-runner
effect: NoSchedule
YAML
} |
The issue here is that when the kubernetes mode is used, the values of We use Kyverno to workaround this. When a *-workflow pod is created, it mutates it to fit it to our needs. |
Looks like actions/runner-container-hooks#96 is going to solve this issue in the future. |
actions/runner-container-hooks#75 Solved the issue on the container-hooks side, now I guess this repository should implement its part |
This feature is crucial to our application. Any chance to address it in the next release? |
👍 here - we need to use tolerations and nodeSelectors to target ARM nodes for faster multi-platform Docker images. Currently |
I've taken a stab at implementing this guys -> #3174 Would be very grateful for input of a maintainer! 🙌 |
Actually, working on the implementation made me realize it’s already possible without any code changes to this project if you’re willing to jump through some configuration hoops:
FROM summerwind/actions-runner:v2.311.0-ubuntu-20.04
ARG RUNNER_CONTAINER_HOOKS_VERSION=0.5.0
RUN cd "$RUNNER_ASSETS_DIR" \
&& sudo rm -rf ./k8s && pwd \
&& curl -fLo runner-container-hooks.zip https://github.com/actions/runner-container-hooks/releases/download/v${RUNNER_CONTAINER_HOOKS_VERSION}/actions-runner-hooks-k8s-${RUNNER_CONTAINER_HOOKS_VERSION}.zip \
&& unzip ./runner-container-hooks.zip -d ./k8s \
&& rm -f runner-container-hooks.zip
USER runner
ENTRYPOINT ["/bin/bash", "-c"]
CMD ["entrypoint.sh"] Then make sure you use this image as your runners (can be set in the helm chart) image:
actionsRunnerRepositoryAndTag: "myrepo/runner:0.5.0"
apiVersion: v1
kind: ConfigMap
metadata:
name: podtemplates
data:
gpu.yaml: |
spec:
securityContext:
runAsUser: 0
containers:
- name: $job # overwrites job container
env:
- name: POETRY_CACHE_DIR
value: "/ci-cache/poetry"
volumeMounts:
- name: ci-cache
mountPath: /ci-cache
volumes:
- name: ci-cache
hostPath:
path: /root
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runner-gpu
spec:
replicas: 1
template:
spec:
containerMode: "kubernetes"
labels:
- my-runners
# manually add this env var that points to the file location
# of your template
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: "/templates/gpu.yaml"
# we set the GPU resources on the runners and abuse the fact that
# all GPUs become available to pods on the same node
resources:
limits:
nvidia.com/gpu: 1
# mount your configmap into your runner
volumeMounts:
- name: templates
mountPath: /templates
volumes:
- name: templates
configMap:
name: podtemplates And now run a CI job that uses this runner. Then you should see that the |
Any updates on this? I'm also interested in a similar application to @nielstenboom |
Controller Version
0.25.2
Helm Chart Version
0.20.2
CertManager Version
1.9.1
Deployment Method
Helm
cert-manager installation
Cert-manager is installed using helmfile
Contents of values.yaml
Contents of helmfile.yaml
Then install it executing
helmfile apply
Checks
Resource Definitions
To Reproduce
Describe the bug
Once everything has been deployed, a runner pod is created in the GPU node. This pod has the correct:
nvidia.com/gpu: 1
Once a while, the workflow is launched in a new pod, but this pod doesn't contains any of the above fields in its manifest, so I don't have GPU resources, binaries, etc mounted in the pod
Describe the expected behavior
It is expected that the pod that is running the actual workflow inherits or can be configured in such a way that:
nvidia.com/gpu: 1
these fields are specified in the pod allowing me to schedule the pod in a GPU-enabled node, and to configure the resources so GKE NVIDIA device-plugin can read the limits and pass through the GPU resources to the workflow pod.
Controller Logs
The text was updated successfully, but these errors were encountered: