Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pods that have been evicted via the eviction API. #1560

Merged
merged 6 commits into from
May 29, 2024

Conversation

Vlatombe
Copy link
Member

@Vlatombe Vlatombe commented May 27, 2024

Adds an explicit message about an agent pod that has been evicted using the eviction API (cf. https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions)

Adds a test that is verifying retry behaviour when an agent pod has been evicted (such as what happens if it runs on a kubernetes node that gets drained)

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@Vlatombe Vlatombe requested a review from a team as a code owner May 27, 2024 13:58
@Vlatombe Vlatombe added the enhancement Improvements label May 27, 2024
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KubernetesPipelineTest.podDeadlineExceeded and other failures. Might need to convert some text literals to constants. (Messages is not so useful since localized text in build logs does not really work.)

.filter(c -> "EvictionByEvictionAPI".equals(c.getReason()))
.findFirst();
if (evictionCondition.isPresent()) {
sb.append(" Pod was evicted by the Kubernetes Eviction API.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the Kubernetes Eviction API does not really add much here. Maybe explain that the node might have been drained, etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No way to know precisely from this single condition, drain is only one of the root causes that could lead to an eviction

@Vlatombe
Copy link
Member Author

The behaviour doesn't seem fully consistent, sometimes by the time Reaper queries the pod, it is already gone so can't look up conditions.

@Vlatombe
Copy link
Member Author

(only one actual timeout, causing other tests to fail)

@Vlatombe
Copy link
Member Author

it is already gone

Double-checked expected behaviour for evicted pods and they should not be deleted automatically, so this may be a race condition with another part of the plugin. Will check whether I can make it behave as I want for the test.

@Vlatombe Vlatombe merged commit 41b3ef1 into jenkinsci:master May 29, 2024
6 checks passed
@Vlatombe Vlatombe deleted the eviction-api-message branch May 29, 2024 12:34
@jonesbusy
Copy link
Contributor

Not sure if a bug, but I'm running 4246.v5a_12b_1fe120e and just faced a pod evicted due to disk pressure.

Message is displayed correctly but no retry was performed. And i'm not sure why

[09:34:56.651+02:00] - [Pipeline] podTemplate
[09:34:56.653+02:00] - [Pipeline] {
[09:34:56.670+02:00] - [Pipeline] retry
[09:34:56.671+02:00] - [Pipeline] {
[09:34:56.688+02:00] - [Pipeline] node
[09:35:01.173+02:00] - Created Pod: openshift ********-tests-pr-80-1-clgth-s5lfg-pmn55
[09:35:01.317+02:00] - ***********-configuration-tests-pr-80-1-clgth-s5lfg-pmn55 Pod just failed. Reason: Evicted, Message: Pod was rejected: The node had condition: [DiskPressure]. .
[09:35:01.358+02:00] - [Pipeline] // node
[09:35:01.373+02:00] - [Pipeline] }
[09:35:01.382+02:00] -  did not look like a Kubernetes agent judging by [12, Debian, Debian-12, Linux, amd64, amd64-Debian, amd64-Debian-12, linux]; make sure retry is inside podTemplate, not outside
[09:35:01.389+02:00] - [Pipeline] // retry
[09:35:01.394+02:00] - [Pipeline] }
[09:35:01.401+02:00] - [Pipeline] // podTemplate
[09:35:01.411+02:00] - [Pipeline] stage
[09:35:01.412+02:00] - [Pipeline] { (Declarative: Post Actions)
[09:35:01.439+02:00] - [Pipeline] recordIssues
[09:35:01.458+02:00] - Error when executing always post condition:
[09:35:01.463+02:00] - Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 9cb86d78-e218-45ee-ba1d-51c095b9c5a9
[09:35:01.463+02:00] - org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing

I'm running declarative pipeline

agent {
	kubernetes {
		cloud('openshift')
		retries(3)
		workspaceVolume(
			emptyDirWorkspaceVolume()
		)
		yaml('*****')
	}
}

I it instead a Kubernetes agent. Labels are added using the platformlabeler (not sure if it could be cause of the issue)

Thanks

@Vlatombe
Copy link
Member Author

Vlatombe commented Jun 28, 2024

Looks like a bug with platformlabeler, that is removing the unique label generated by the kubernetes plugin (also noting that this is unrelated with this particular PR).

@jonesbusy
Copy link
Contributor

Thanks I will check this. I was also confused to not see the pod template unique label on the logs

In the meanwhile I'm wondering if using the retry at the option level will catch this case

options {
    retry(conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 3)
}

@jonesbusy
Copy link
Contributor

I did some tests using on my OKD cluster

options {
    retry(conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 3)
}

And by draining the node

pod/*****-8-m5g2b evicted
node/**** evicted
...
[14:14:36.964+02:00] - *******-m5g2b Containers flux,jf,jnlp,kubedock,maven,oc were terminated. Pod was evicted by the Kubernetes Eviction API.
...
[14:14:37.716+02:00] - Ignored termination reason(s) for eature-2fglobal-retry-with-with-kubernetes-agents-retry-8-m5g2b for purposes of retry: [Completed, EvictionByEvictionAPI, Error]
...

because of

if (terminationReasons.stream().anyMatch(IGNORED_CONTAINER_TERMINATION_REASONS::contains)) {
	listener.getLogger()
			.println("Ignored termination reason(s) for " + node + " for purposes of retry: "
					+ terminationReasons);
	return false;
}

and terminationReasons that contains [Completed, EvictionByEvictionAPI, Error]

Looking also at the return code, not all containers return same status code

[14:14:37.130+02:00] - - flux -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - jf -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - jnlp -- terminated (143)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - kubedock -- terminated (0)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - maven -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - oc -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------

One fix would be to return directly true if the terminationReasons contains EvictionByEvictionAPI ? What do you think about that ?

@Vlatombe
Copy link
Member Author

Ah, nice catch. I think the condition should be changed to ignore if it contains only ignored reasons.

@jonesbusy
Copy link
Contributor

I've open a PR here : #1582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants