-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner container does not restart even if the job ends normally #77
Comments
Perhaps it is related to the issue below? |
also related #69 |
This appears to be tied to when jobs land in a failed state
|
@ZacharyBenamram Thanks for raising another point to see! @ahmad-hamade has submitted a detailed report related to this in #198, and it seems this happens even when the job succeeded. |
As the stopped runner eventually removed and recreated, this might be due to that any of our controllers doesn't detect the stopped container until the next resync period? Generally speaking, any controller that creates pods should "watch" events from the pod, and enqueue a reconciliation on the parent "runner" resource on every pod event. Maybe our runner-controller isn't working like that? |
Today I faced the same problem again and my job completed with the result succeeded. Here are the logs:
|
@ahmad-hamade / @naka-gawa - do you see the job running and completing within a similar time window by either an "Updated registration token" event or a "the object has been modified; please apply your changes to the latest version and try again" error? |
Top right graph is a sample of stopped docker containers - this also seems to align with the workqueue_depth dropping to zero. The sync period set is 5minute intervals - as seen by the spacing between metrics emitted. Also note the spike in workqueue_total_retries during the period of stopped containers. |
I think we may resolve this issue if we upgrade the controller-runtime to v0.6.3 - testing now |
Hi @ZacharyBenamram, Yes, I can see these logs in the controller once the runner goes into the completed state with And my workflow jobs will be stuck for around 5min in queue state until the controller delete and re-create a new pod. |
@ahmad-hamade - can you provide timestamps for when the container terminates, when the logs say that the token has been updated, and also any other relevant logs around pod creation / deletion timestamps? |
I think these errors are not related to the controller-runtime, but are the result of updating the token while a pod is running a job. I think you'll also see that the update registration token event happens multiple times for the same runner. It appears as if the Runner resource is not being updated quickly enough and so the the reconcile function doesnt have the opportunity to delete the pod, since the deletion logic is the last portion of the runner reconciliation. @mumoshu thoughts? |
Sure @ZacharyBenamram I will provide you the logs once the issue occurred. |
@ZacharyBenamram Below are the logs: k describe po -n runner-infra runner-infra-hr9bs-x687s
k describe runners.actions.summerwind.dev -n runner-infra runner-infra-hr9bs-x687s
Controller logs
|
The error is here: https://github.com/summerwind/actions-runner-controller/blob/ee8fb5a3886ef5a75df5c126bcd3c846e13c801e/github/github.go#L87 The token does not get refreshed for a 10 minute interval. The old token is returned and the reconcile function returns with Requeue: true - after the token update function is called. If you reduce this 10 minute interval - this should repair this problem. @mumoshu Since token expiration is not a huge deal - the runner can still make calls even if the registration token has expired, I suggest we change this function to:
|
Seems like @ZacharyBenamram's fix worked out well and we haven't seen this since then! Thanks. Closing as resolved |
…#77) Co-authored-by: Tingluo Huang <tingluohuang@github.com>
Hai ! Thank you always for your Kubernetes operator !!
I have a question.
Even if the action job finishes successfully, only certain pods get stuck without performing a runner container restart...
runnerDeployment’s manifest:
runner's status:
I’m thinking that the runner container will be restarted when the actions jobs normally finish running by the following processing.
Is my perception correct?
https://github.com/summerwind/actions-runner-controller/blob/ba8f61141b30268a00387795a66abdd72b60c78c/controllers/runner_controller.go#L194-L196
runner's status
Thank you!
The text was updated successfully, but these errors were encountered: