Handle pods that have been evicted via the eviction API. #1560

Vlatombe · 2024-05-27T13:58:25Z

Adds an explicit message about an agent pod that has been evicted using the eviction API (cf. https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions)

Adds a test that is verifying retry behaviour when an agent pod has been evicted (such as what happens if it runs on a kubernetes node that gets drained)

Testing done

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

jglick

KubernetesPipelineTest.podDeadlineExceeded and other failures. Might need to convert some text literals to constants. (Messages is not so useful since localized text in build logs does not really work.)

jglick · 2024-05-28T11:22:09Z

src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java

+                        .filter(c -> "EvictionByEvictionAPI".equals(c.getReason()))
+                        .findFirst();
+                if (evictionCondition.isPresent()) {
+                    sb.append(" Pod was evicted by the Kubernetes Eviction API.");


by the Kubernetes Eviction API does not really add much here. Maybe explain that the node might have been drained, etc.?

No way to know precisely from this single condition, drain is only one of the root causes that could lead to an eviction

Vlatombe · 2024-05-28T15:04:36Z

The behaviour doesn't seem fully consistent, sometimes by the time Reaper queries the pod, it is already gone so can't look up conditions.

Vlatombe · 2024-05-28T15:43:13Z

(only one actual timeout, causing other tests to fail)

Vlatombe · 2024-05-29T08:41:18Z

it is already gone

Double-checked expected behaviour for evicted pods and they should not be deleted automatically, so this may be a race condition with another part of the plugin. Will check whether I can make it behave as I want for the test.

jonesbusy · 2024-06-28T07:49:02Z

Not sure if a bug, but I'm running 4246.v5a_12b_1fe120e and just faced a pod evicted due to disk pressure.

Message is displayed correctly but no retry was performed. And i'm not sure why

[09:34:56.651+02:00] - [Pipeline] podTemplate
[09:34:56.653+02:00] - [Pipeline] {
[09:34:56.670+02:00] - [Pipeline] retry
[09:34:56.671+02:00] - [Pipeline] {
[09:34:56.688+02:00] - [Pipeline] node
[09:35:01.173+02:00] - Created Pod: openshift ********-tests-pr-80-1-clgth-s5lfg-pmn55
[09:35:01.317+02:00] - ***********-configuration-tests-pr-80-1-clgth-s5lfg-pmn55 Pod just failed. Reason: Evicted, Message: Pod was rejected: The node had condition: [DiskPressure]. .
[09:35:01.358+02:00] - [Pipeline] // node
[09:35:01.373+02:00] - [Pipeline] }
[09:35:01.382+02:00] -  did not look like a Kubernetes agent judging by [12, Debian, Debian-12, Linux, amd64, amd64-Debian, amd64-Debian-12, linux]; make sure retry is inside podTemplate, not outside
[09:35:01.389+02:00] - [Pipeline] // retry
[09:35:01.394+02:00] - [Pipeline] }
[09:35:01.401+02:00] - [Pipeline] // podTemplate
[09:35:01.411+02:00] - [Pipeline] stage
[09:35:01.412+02:00] - [Pipeline] { (Declarative: Post Actions)
[09:35:01.439+02:00] - [Pipeline] recordIssues
[09:35:01.458+02:00] - Error when executing always post condition:
[09:35:01.463+02:00] - Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 9cb86d78-e218-45ee-ba1d-51c095b9c5a9
[09:35:01.463+02:00] - org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing

I'm running declarative pipeline

agent {
	kubernetes {
		cloud('openshift')
		retries(3)
		workspaceVolume(
			emptyDirWorkspaceVolume()
		)
		yaml('*****')
	}
}

I it instead a Kubernetes agent. Labels are added using the platformlabeler (not sure if it could be cause of the issue)

Thanks

Vlatombe · 2024-06-28T08:20:54Z

Looks like a bug with platformlabeler, that is removing the unique label generated by the kubernetes plugin (also noting that this is unrelated with this particular PR).

jonesbusy · 2024-06-28T09:14:31Z

Thanks I will check this. I was also confused to not see the pod template unique label on the logs

In the meanwhile I'm wondering if using the retry at the option level will catch this case

options {
    retry(conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 3)
}

jonesbusy · 2024-06-28T12:32:19Z

I did some tests using on my OKD cluster

options {
    retry(conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 3)
}

And by draining the node

pod/*****-8-m5g2b evicted
node/**** evicted

...
[14:14:36.964+02:00] - *******-m5g2b Containers flux,jf,jnlp,kubedock,maven,oc were terminated. Pod was evicted by the Kubernetes Eviction API.
...
[14:14:37.716+02:00] - Ignored termination reason(s) for eature-2fglobal-retry-with-with-kubernetes-agents-retry-8-m5g2b for purposes of retry: [Completed, EvictionByEvictionAPI, Error]
...

because of

if (terminationReasons.stream().anyMatch(IGNORED_CONTAINER_TERMINATION_REASONS::contains)) {
	listener.getLogger()
			.println("Ignored termination reason(s) for " + node + " for purposes of retry: "
					+ terminationReasons);
	return false;
}

and terminationReasons that contains [Completed, EvictionByEvictionAPI, Error]

Looking also at the return code, not all containers return same status code

[14:14:37.130+02:00] - - flux -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - jf -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - jnlp -- terminated (143)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - kubedock -- terminated (0)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - maven -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------
[14:14:37.130+02:00] - - oc -- terminated (137)
[14:14:37.130+02:00] - -----Logs-------------

One fix would be to return directly true if the terminationReasons contains EvictionByEvictionAPI ? What do you think about that ?

Vlatombe · 2024-06-28T12:51:30Z

Ah, nice catch. I think the condition should be changed to ignore if it contains only ignored reasons.

jonesbusy · 2024-06-28T13:12:50Z

I've open a PR here : #1582

Handle pods that have been evicted via the eviction API.

cc7b907

Vlatombe requested a review from a team as a code owner May 27, 2024 13:58

Vlatombe added the enhancement Improvements label May 27, 2024

jglick reviewed May 28, 2024

View reviewed changes

Vlatombe added 2 commits May 28, 2024 14:09

Grant pods/eviction for tests

b7931ab

Fix tests

ce3c478

Vlatombe added 3 commits May 29, 2024 11:05

Handle container termination event as well.

279d7c9

Update kind/kubectl to get proper conditions

f5c8e00

Fix messages

5fb4ec8

jglick approved these changes May 29, 2024

View reviewed changes

Vlatombe merged commit 41b3ef1 into jenkinsci:master May 29, 2024
6 checks passed

Vlatombe deleted the eviction-api-message branch May 29, 2024 12:34

jonesbusy mentioned this pull request Jun 28, 2024

Ensure retry if EvictionByEvictionAPI is found on termination reasons #1582

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle pods that have been evicted via the eviction API. #1560

Handle pods that have been evicted via the eviction API. #1560

Vlatombe commented May 27, 2024 •

edited

Loading

jglick left a comment

jglick May 28, 2024

Vlatombe May 28, 2024

Vlatombe commented May 28, 2024

Vlatombe commented May 28, 2024

Vlatombe commented May 29, 2024

jonesbusy commented Jun 28, 2024

Vlatombe commented Jun 28, 2024 •

edited

Loading

jonesbusy commented Jun 28, 2024

jonesbusy commented Jun 28, 2024

Vlatombe commented Jun 28, 2024

jonesbusy commented Jun 28, 2024

Handle pods that have been evicted via the eviction API. #1560

Handle pods that have been evicted via the eviction API. #1560

Conversation

Vlatombe commented May 27, 2024 • edited Loading

Testing done

Submitter checklist

jglick left a comment

Choose a reason for hiding this comment

jglick May 28, 2024

Choose a reason for hiding this comment

Vlatombe May 28, 2024

Choose a reason for hiding this comment

Vlatombe commented May 28, 2024

Vlatombe commented May 28, 2024

Vlatombe commented May 29, 2024

jonesbusy commented Jun 28, 2024

Vlatombe commented Jun 28, 2024 • edited Loading

jonesbusy commented Jun 28, 2024

jonesbusy commented Jun 28, 2024

Vlatombe commented Jun 28, 2024

jonesbusy commented Jun 28, 2024

Vlatombe commented May 27, 2024 •

edited

Loading

Vlatombe commented Jun 28, 2024 •

edited

Loading