Investigate the constant pipeline failures #993

git-hyagi · 2023-06-29T13:52:14Z

Verify if the API pods crashing is related to a memory leak (https://discourse.pulpproject.org/t/api-server-memory-leak/851/12).

Some ideas to help with the investigation:

deploy a local minikube (with metrics server) or crc (with monitoring enabled), run the same set of tests we do in our pipeline, and check the API pods' resources utilization
we can also try to get the logs from the previous instance of the container from API pods (kubectl logs -p) in the show_logs.sh script to see if we can get more information
https://github.com/pulp/pulp-operator/blob/main/.github/workflows/scripts/show_logs.sh#L30

The text was updated successfully, but these errors were encountered:

git-hyagi · 2023-07-03T20:12:29Z

Actually, the API pods were not crashing but getting restarted because of liveness probe failures.
After doing some tests and modifying the show_logs.sh script from CI I could see:

    Warning  Unhealthy  4m15s (x5 over 5m35s)  kubelet  Liveness probe failed: Get "http://10.244.0.9:24817/pulp/api/v3/status/": dial tcp 10.244.0.9:24817: connect: connection refused

from kubectl describe pod <api pod>:

Containers:
    api:
      Container ID:  docker://49178ba159378cb6f76e9f4f8078d1058c7a71029902fdf946746fd37775263e
      Image:         quay.io/pulp/pulp-minimal:nightly
      Image ID:      docker-pullable://quay.io/pulp/pulp-minimal@sha256:16f6c061991fa1418223ddf87fc35c9a1dbb1b1ab74a9d983c8798ee09f759df
      Port:          24817/TCP
      Host Port:     0/TCP
      Args:
        pulp-api
      State:          Running
        Started:      Mon, 03 Jul 2023 18:50:08 +0000
      Last State:     Terminated
        Reason:       Error    <------- NOT OOM
        Exit Code:    137
        Started:      Mon, 03 Jul 2023 18:46:09 +0000
        Finished:     Mon, 03 Jul 2023 18:50:07 +0000

Checking the logs from a restarted pod, it was running migration tasks when the container got terminated.
As a workaround, we can increase the default failure threshold for the liveness probe while #991 is not implemented.

closes: pulp#993

closes: #993

git-hyagi added Task Triage-Needed labels Jun 29, 2023

git-hyagi added a commit to git-hyagi/pulp-operator that referenced this issue Jul 5, 2023

Update tests to decrease liveness probe errors

44821b5

closes: pulp#993

git-hyagi mentioned this issue Jul 5, 2023

Update tests to decrease liveness probe errors #998

Merged

git-hyagi closed this as completed in #998 Jul 10, 2023

git-hyagi added a commit that referenced this issue Jul 10, 2023

Update tests to decrease liveness probe errors

4ba22b5

closes: #993

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the constant pipeline failures #993

Investigate the constant pipeline failures #993

git-hyagi commented Jun 29, 2023 •

edited

Loading

git-hyagi commented Jul 3, 2023

Investigate the constant pipeline failures #993

Investigate the constant pipeline failures #993

Comments

git-hyagi commented Jun 29, 2023 • edited Loading

git-hyagi commented Jul 3, 2023

git-hyagi commented Jun 29, 2023 •

edited

Loading