Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom metrics gc no longer cleans old emitted metrics #11542

Closed
2 of 3 tasks
fhochleitner opened this issue Aug 8, 2023 · 5 comments · Fixed by #11553
Closed
2 of 3 tasks

custom metrics gc no longer cleans old emitted metrics #11542

fhochleitner opened this issue Aug 8, 2023 · 5 comments · Fixed by #11553
Labels
area/metrics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@fhochleitner
Copy link
Contributor

fhochleitner commented Aug 8, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

After upgrading to v3.4.9 (and latest) custom metrics will no longer get cleaned up after "metricsTTL" time has been reached and will stay exposed at the metrics endpoint endlessly.

e.g. example workflow emits the following metric argo_workflows_result_task_counter{id="b68baed7-b2a5-4718-bac4-221d559c750f",status="Succeeded",task_name="example-metrics"} 1 which will stay until pod get recreated.

downgrade to v3.4.8 without any configuration change leads to expected behaviour of metric no longer being exposed after TTL time has been reached.

workflow-controller-configmap for reference:

parallelism: 10
namespaceParallelism: 5
mainContainer:
  imagePullPolicy: IfNotPresent
  resources:
    limits:
      cpu: 1
      memory: 4Gi
    requests:
      cpu: 250m
      memory: 512Mi
  securityContext:
    runAsGroup: 0
    runAsNonRoot: true
    runAsUser: 1000
executor:
  imagePullPolicy: IfNotPresent
  securityContext:
    runAsGroup: 0
    runAsNonRoot: true
    runAsUser: 1000
artifactRepository:
  archiveLogs: true
  s3:
    # --- omitted ---
metricsConfig:
  enabled: true
  path: /metrics
  port: 9090
  metricsTTL: 5m
  ignoreErrors: false
  secure: false
persistence:
  archive: true
  archiveTTL: 120d
  connectionPool:
    maxIdleConns: 50
  nodeStatusOffLoad: true
  postgresql:
    # --- omitted --- 
workflowDefaults:
  spec:
    securityContext:
      fsGroup: 0
    ttlStrategy:
      secondsAfterCompletion: 3600
sso:
 # --- omitted ---
nodeEvents:
  enabled: true

Version

v3.4.9, latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: metrics-regressions-example-
  namespace: ${NAMESPACE}
spec:
  entrypoint: reproduceable-example
  templates:
    - name: reproduceable-example
      steps:
        - - name: workflow-metrics
            template: workflow-metrics
      container:
        name: alpine
        image: alpine:3.18
        command:
          - sh
          - -c
        args:
          - 'echo "Hello World"'
      metrics:
        prometheus:
          - name: result_task_counter
            help: "Count of task execution by result status  - task level"
            labels:
              - key: "task_name"
                value: "example-metrics"
              - key: "status"
                value: "{{ status }}"
              - key: "id"
                value: "{{ workflow.uid }}"
            counter:
              value: "1"

Logs from the workflow controller

fhochleitner@fhochleitner-gepardec ~  $ kubectl logs -n gp-cicd-tools deploy/argo-workflows-workflow-controller | grep metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.719Z" level=info msg="Processing workflow" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.730Z" level=info msg="Updated phase  -> Running" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.730Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.731Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.733Z" level=info msg="Steps node metrics-regressions-example-jcbsw initialized Running" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.733Z" level=info msg="StepGroup node metrics-regressions-example-jcbsw-1123282000 initialized Running" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.733Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.734Z" level=info msg="Pod node metrics-regressions-example-jcbsw-866873007 initialized Pending" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.816Z" level=info msg="Created pod: metrics-regressions-example-jcbsw[0].workflow-metrics (metrics-regressions-example-jcbsw-workflow-metrics-866873007)" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.816Z" level=info msg="Workflow step group node metrics-regressions-example-jcbsw-1123282000 not yet completed" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.816Z" level=info msg="TaskSet Reconciliation" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.816Z" level=info msg=reconcileAgentPod namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:53:56.835Z" level=info msg="Workflow update successful" namespace=demo-microservice-cicd phase=Running resourceVersion=160787809 workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.822Z" level=info msg="Processing workflow" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.829Z" level=info msg="Task-result reconciliation" namespace=demo-microservice-cicd numObjs=1 workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.829Z" level=info msg="task-result changed" namespace=demo-microservice-cicd nodeID=metrics-regressions-example-jcbsw-866873007 workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.830Z" level=info msg="node changed" namespace=demo-microservice-cicd new.message= new.phase=Succeeded new.progress=0/1 nodeID=metrics-regressions-example-jcbsw-866873007 old.message= old.phase=Pending old.progress=0/1 workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="Step group node metrics-regressions-example-jcbsw-1123282000 successful" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="node metrics-regressions-example-jcbsw-1123282000 phase Running -> Succeeded" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="node metrics-regressions-example-jcbsw-1123282000 finished: 2023-08-08 09:54:06.832304852 +0000 UTC" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="Outbound nodes of metrics-regressions-example-jcbsw-866873007 is [metrics-regressions-example-jcbsw-866873007]" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="Outbound nodes of metrics-regressions-example-jcbsw is [metrics-regressions-example-jcbsw-866873007]" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="node metrics-regressions-example-jcbsw phase Running -> Succeeded" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="node metrics-regressions-example-jcbsw finished: 2023-08-08 09:54:06.832442226 +0000 UTC" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="TaskSet Reconciliation" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg=reconcileAgentPod namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="Updated phase Running -> Succeeded" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="Marking workflow completed" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.832Z" level=info msg="Marking workflow as pending archiving" namespace=demo-microservice-cicd workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.839Z" level=info msg="cleaning up pod" action=deletePod key=demo-microservice-cicd/metrics-regressions-example-jcbsw-1340600742-agent/deletePod
time="2023-08-08T09:54:06.846Z" level=info msg="Workflow update successful" namespace=demo-microservice-cicd phase=Succeeded resourceVersion=160788047 workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.859Z" level=info msg="archiving workflow" namespace=demo-microservice-cicd uid=b68baed7-b2a5-4718-bac4-221d559c750f workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:06.863Z" level=info msg="cleaning up pod" action=labelPodCompleted key=demo-microservice-cicd/metrics-regressions-example-jcbsw-workflow-metrics-866873007/labelPodCompleted
time="2023-08-08T09:54:06.893Z" level=info msg="Queueing Succeeded workflow demo-microservice-cicd/metrics-regressions-example-jcbsw for delete in 1h0m0s due to TTL"
time="2023-08-08T10:03:10.656Z" level=info msg="Queueing Succeeded workflow demo-microservice-cicd/metrics-regressions-example-jcbsw for delete in 50m56s due to TTL"
time="2023-08-08T10:08:34.645Z" level=info msg="Queueing Succeeded workflow demo-microservice-cicd/metrics-regressions-example-jcbsw for delete in 45m32s due to TTL"

Logs from in your workflow's wait container

fhochleitner@fhochleitner-gepardec ~  $ kubectl logs -n demo-microservice-cicd -c wait -l workflows.argoproj.io/workflow=metrics-regressions-example-jcbsw
time="2023-08-08T09:54:04.077Z" level=info msg="No output parameters"
time="2023-08-08T09:54:04.077Z" level=info msg="No output artifacts"
time="2023-08-08T09:54:04.079Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: metrics-regressions-example-jcbsw/metrics-regressions-example-jcbsw-workflow-metrics-866873007/main.log"
time="2023-08-08T09:54:04.079Z" level=info msg="Creating minio client using static credentials" endpoint=sos-at-vie-1.exo.io
time="2023-08-08T09:54:04.131Z" level=info msg="Saving file to s3" bucket=c-gepa-play-argo-workflows endpoint=sos-at-vie-1.exo.io key=metrics-regressions-example-jcbsw/metrics-regressions-example-jcbsw-workflow-metrics-866873007/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-08-08T09:54:04.289Z" level=info msg="Save artifact" artifactName=main-logs duration=209.993381ms error="<nil>" key=metrics-regressions-example-jcbsw/metrics-regressions-example-jcbsw-workflow-metrics-866873007/main.log
time="2023-08-08T09:54:04.289Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-08-08T09:54:04.289Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-08-08T09:54:04.316Z" level=info msg="Alloc=11600 TotalAlloc=20129 Sys=32893 NumGC=5 Goroutines=10"
time="2023-08-08T09:54:04.316Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
@fhochleitner fhochleitner added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Aug 8, 2023
@agilgur5
Copy link

agilgur5 commented Aug 8, 2023

Follow-up from Slack.
At least two regressions being tracked against 3.4.9

@agilgur5
Copy link

agilgur5 commented Aug 9, 2023

This seems to have the same root cause as #11541, which is due to a regression caused by #11178.
metricsTTL also relies on the config and is set-up during initialization. #11178 moved the config update to post-init.

Seems like the entrypoint may also be affected, and the metrics config affects rate limiting.

Will have to check if all of that logic can be moved to post-init without breaking anything.

@terrytangyuan
Copy link
Member

cc @astraw99

@terrytangyuan
Copy link
Member

Fixing in #11553 to revert the commit.

@sarabala1979 sarabala1979 added the P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority label Aug 10, 2023
@fhochleitner
Copy link
Contributor Author

thank you all for the very quick help and fix with the bug. you're amazing!

terrytangyuan added a commit that referenced this issue Aug 11, 2023
…, #11541 (#11553)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024
…proj#11542, argoproj#11541 (argoproj#11553)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
4 participants