Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webserver keeps restarting on k8s environment #909

Open
2 tasks done
yalattas opened this issue Feb 26, 2025 · 0 comments
Open
2 tasks done

webserver keeps restarting on k8s environment #909

yalattas opened this issue Feb 26, 2025 · 0 comments
Labels
kind/bug kind - things not working properly

Comments

@yalattas
Copy link

Checks

Chart Version

8.9.0

Kubernetes Version

Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.1-eks-8cce635

Helm Version

ArgoCD deployment

Description

Everything works fine initially. Then few dags were importing via GitSync. Webserver GUI stops working. Specially after integrating with Google SSO
Yet, issue occured before. But now the GUI is unusable after Google SSO and few dags. Total (3 dags)

Suddenly, webserver keeps restarting due to Liveness Probe, we increased it to 60 seconds and increased the workers in Gunicorn to 10 workers. We don't have any resource issue as K8s clearly show minimum utilization of memory (almost 1.2 GB per pod)

To explain the problem.
The GUI takes forever to load. Sometimes partial load is happening with many 503 error on CSS and more things. Making the usage of webserver impossible

Ignore SSL configs. We're using self-signed cert generated by init container to complete the SSO setup of Google and its working fine

Relevant Logs

2025-02-26 19:25:41.646	[2025-02-26T16:25:41.645+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.646	[2025-02-26T16:25:41.645+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.646	[2025-02-26T16:25:41.644+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.646	[2025-02-26T16:25:41.644+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.646	[2025-02-26 16:25:41 +0000] [104] [INFO] Worker exiting (pid: 104)
2025-02-26 19:25:41.636	[2025-02-26 16:25:41 +0000] [38] [ERROR] Worker (pid:95) was sent SIGTERM!
2025-02-26 19:25:41.624	[2025-02-26T16:25:41.618+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.624	[2025-02-26T16:25:41.618+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.616+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.616+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.614+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.614+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.613+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.613+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.603+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.603+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.610+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.610+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.610+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.610+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.609+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.608+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.609+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.608+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.606+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.606+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.606+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.606+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.604+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.604+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.604+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26T16:25:41.604+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.623	[2025-02-26 16:25:41 +0000] [100] [INFO] Worker exiting (pid: 100)
2025-02-26 19:25:41.623	[2025-02-26 16:25:41 +0000] [101] [INFO] Worker exiting (pid: 101)
2025-02-26 19:25:41.623	[2025-02-26 16:25:41 +0000] [38] [INFO] Handling signal: term
2025-02-26 19:25:41.623	[2025-02-26 16:25:41 +0000] [96] [INFO] Worker exiting (pid: 96)
2025-02-26 19:25:41.623	[2025-02-26 16:25:41 +0000] [99] [INFO] Worker exiting (pid: 99)
2025-02-26 19:25:41.623	[2025-02-26 16:25:41 +0000] [97] [INFO] Worker exiting (pid: 97)
2025-02-26 19:25:41.623	[2025-02-26 16:25:41 +0000] [98] [INFO] Worker exiting (pid: 98)
2025-02-26 19:25:41.603	[2025-02-26T16:25:41.601+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.603	[2025-02-26T16:25:41.601+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.601	[2025-02-26 16:25:41 +0000] [102] [INFO] Worker exiting (pid: 102)
2025-02-26 19:25:41.599	[2025-02-26T16:25:41.598+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.599	[2025-02-26T16:25:41.598+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.598	[2025-02-26T16:25:41.597+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.598	[2025-02-26T16:25:41.597+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.598	[2025-02-26T16:25:41.597+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.597	[2025-02-26T16:25:41.597+0000] {impl.py:170} INFO - Pool recreating
2025-02-26 19:25:41.597	[2025-02-26T16:25:41.596+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.597	[2025-02-26T16:25:41.596+0000] {impl.py:195} INFO - Pool disposed. Pool size: 5  Connections in pool: 0 Current Overflow: -5 Current Checked out connections: 0
2025-02-26 19:25:41.596	[2025-02-26T16:25:41.594+0000] {webserver_command.py:429} INFO - Received signal: 15. Closing gunicorn.
2025-02-26 19:25:41.595	[2025-02-26 16:25:41 +0000] [95] [INFO] Worker exiting (pid: 95)
2025-02-26 19:25:41.595	[2025-02-26 16:25:41 +0000] [103] [INFO] Worker exiting (pid: 103)
2025-02-26 19:25:35.544	10.0.30.87 - - [26/Feb/2025:16:25:35 +0000] "GET /health HTTP/1.1" 200 318 "-" "kube-probe/1.32+"
2025-02-26 19:25:35.541	[2025-02-26T16:25:35.540+0000] {base.py:1092} INFO - COMMIT
2025-02-26 19:25:35.541	[2025-02-26T16:25:35.540+0000] {base.py:1092} INFO - COMMIT
2025-02-26 19:25:35.539	[2025-02-26T16:25:35.538+0000] {base.py:1868} INFO - [cached since 0.01995s ago] {'job_type_1': 'DagProcessorJob', 'param_1': <JobState.RUNNING: 'running'>, 'param_2': 0, 'param_3': 1, 'param_4': 1}
2025-02-26 19:25:35.539	[2025-02-26T16:25:35.538+0000] {base.py:1868} INFO - [cached since 0.01995s ago] {'job_type_1': 'DagProcessorJob', 'param_1': <JobState.RUNNING: 'running'>, 'param_2': 0, 'param_3': 1, 'param_4': 1}
2025-02-26 19:25:35.539	 LIMIT %(param_4)s
2025-02-26 19:25:35.539	WHERE job.job_type = %(job_type_1)s ORDER BY CASE job.state WHEN %(param_1)s THEN %(param_2)s ELSE %(param_3)s END, job.latest_heartbeat DESC 
2025-02-26 19:25:35.539	FROM job 
2025-02-26 19:25:35.539	[2025-02-26T16:25:35.537+0000] {base.py:1863} INFO - SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
2025-02-26 19:25:35.539	 LIMIT %(param_4)s
2025-02-26 19:25:35.539	WHERE job.job_type = %(job_type_1)s ORDER BY CASE job.state WHEN %(param_1)s THEN %(param_2)s ELSE %(param_3)s END, job.latest_heartbeat DESC 
2025-02-26 19:25:35.539	FROM job 
2025-02-26 19:25:35.539	[2025-02-26T16:25:35.537+0000] {base.py:1863} INFO - SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
2025-02-26 19:25:35.539	[2025-02-26T16:25:35.536+0000] {base.py:1032} INFO - BEGIN (implicit)
2025-02-26 19:25:35.537	[2025-02-26T16:25:35.536+0000] {base.py:1032} INFO - BEGIN (implicit)
2025-02-26 19:25:35.531	[2025-02-26T16:25:35.530+0000] {base.py:1092} INFO - COMMIT
2025-02-26 19:25:35.531	[2025-02-26T16:25:35.530+0000] {base.py:1092} INFO - COMMIT
2025-02-26 19:25:35.528	[2025-02-26T16:25:35.528+0000] {base.py:1868} INFO - [cached since 0.009677s ago] {'job_type_1': 'TriggererJob', 'param_1': <JobState.RUNNING: 'running'>, 'param_2': 0, 'param_3': 1, 'param_4': 1}
2025-02-26 19:25:35.528	[2025-02-26T16:25:35.528+0000] {base.py:1868} INFO - [cached since 0.009677s ago] {'job_type_1': 'TriggererJob', 'param_1': <JobState.RUNNING: 'running'>, 'param_2': 0, 'param_3': 1, 'param_4': 1}
2025-02-26 19:25:35.528	 LIMIT %(param_4)s
2025-02-26 19:25:35.528	WHERE job.job_type = %(job_type_1)s ORDER BY CASE job.state WHEN %(param_1)s THEN %(param_2)s ELSE %(param_3)s END, job.latest_heartbeat DESC 
2025-02-26 19:25:35.528	FROM job 
2025-02-26 19:25:35.528	[2025-02-26T16:25:35.527+0000] {base.py:1863} INFO - SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
2025-02-26 19:25:35.528	 LIMIT %(param_4)s
2025-02-26 19:25:35.528	WHERE job.job_type = %(job_type_1)s ORDER BY CASE job.state WHEN %(param_1)s THEN %(param_2)s ELSE %(param_3)s END, job.latest_heartbeat DESC 
2025-02-26 19:25:35.527	FROM job 
2025-02-26 19:25:35.527	[2025-02-26T16:25:35.527+0000] {base.py:1863} INFO - SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
2025-02-26 19:25:35.526	[2025-02-26T16:25:35.526+0000] {base.py:1032} INFO - BEGIN (implicit)
2025-02-26 19:25:35.526	[2025-02-26T16:25:35.526+0000] {base.py:1032} INFO - BEGIN (implicit)
2025-02-26 19:25:35.523	[2025-02-26T16:25:35.522+0000] {base.py:1092} INFO - COMMIT
2025-02-26 19:25:35.522	[2025-02-26T16:25:35.522+0000] {base.py:1092} INFO - COMMIT
2025-02-26 19:25:35.520	[2025-02-26T16:25:35.519+0000] {base.py:1868} INFO - [generated in 0.00081s] {'job_type_1': 'SchedulerJob', 'param_1': <JobState.RUNNING: 'running'>, 'param_2': 0, 'param_3': 1, 'param_4': 1}
2025-02-26 19:25:35.520	[2025-02-26T16:25:35.519+0000] {base.py:1868} INFO - [generated in 0.00081s] {'job_type_1': 'SchedulerJob', 'param_1': <JobState.RUNNING: 'running'>, 'param_2': 0, 'param_3': 1, 'param_4': 1}
2025-02-26 19:25:35.519	 LIMIT %(param_4)s

Custom Helm Values

config:
      # AIRFLOW__WEBSERVER__BASE_URL: http://airflow.example.com
      # https://airflow.apache.org/docs/apache-airflow/2.5.0/configurations-ref.html#config-metrics
      AIRFLOW__WEBSERVER__EXPOSE_CONFIG: "True"
      AIRFLOW__WEBSERVER__WORKERS: 10
      AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW: 100
      AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE: 5
      AIRFLOW__METRICS__STATSD_ON: "True"
      AIRFLOW__METRICS__STATSD_PORT: 8080
      AIRFLOW__METRICS__STATSD_PREFIX: airflow
      AIRFLOW__METRICS__STATSD_HOST: "0.0.0.0"
      AIRFLOW__LOGGING__CELERY_LOGGING_LEVEL: INFO # Supported values: CRITICAL, ERROR, WARNING, INFO, DEBUG
      # AIRFLOW__LOGGING__DAG_PROCESSOR_LOG_FORMAT: "[%%(asctime)s] [SOURCE:DAG_PROCESSOR] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s"
      AIRFLOW__LOGGING__DAG_PROCESSOR_LOG_TARGET: s3
      # AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_STDOUT: true
      AIRFLOW__LOGGING__DELETE_LOCAL_LOGS: "True"
      AIRFLOW__LOGGING__ENCRYPT_S3_LOGS: "False"
      AIRFLOW__LOGGING__EXTRA_LOGGER_NAMES: connexion,sqlalchemy
      AIRFLOW__LOGGING__FAB_LOGGING_LEVEL: INFO
      AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: aws_devops
      AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: s3://airflow-bucket/logs
      AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
      # AIRFLOW__LOGGING__FILE_TASK_HANDLER_NEW_FOLDER_PERMISSIONS: 0o775 # or [other writable: 0o777, different owner: 0o755, readonly: 0o700]
      AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
      # AIRFLOW__LOGGING__LOG_FORMAT: "[%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s"
      AIRFLOW__CORE__TEST_CONNECTION: Enabled
      AIRFLOW__WEBSERVER__WEB_SERVER_SSL_CERT: /devops/certs/server.crt
      AIRFLOW__WEBSERVER__WEB_SERVER_SSL_KEY: /devops/certs/server.key
      AIRFLOW__WEBSERVER__UPDATE_FAB_PERMS: "False"
---
 web:
    replicas: 2
    resources:
      requests:
        cpu: 700m
        memory: 1200Mi
        ephemeral-storage: 1Gi
      limits:
        memory: 2500Mi
    podLabels:
      component: web
    readinessProbe:
      periodSeconds: 60
    livenessProbe:
      periodSeconds: 60
    safeToEvict: true
    podDisruptionBudget:
      enabled: false
      apiVersion: policy/v1
      maxUnavailable: ""
      minAvailable: ""
    extraPipPackages:
      - airflow-exporter>=1.5,<2 # https://github.com/epoch8/airflow-exporter
      - apache-airflow[amazon]>=2.7,<3
      - apache-airflow[google_auth]>=2.7,<3
    extraVolumeMounts:
    - name: ssl-cert-volume
      mountPath: /devops
    extraVolumes:
    - name: ssl-cert-volume
      emptyDir: {}
@yalattas yalattas added the kind/bug kind - things not working properly label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind - things not working properly
Projects
None yet
Development

No branches or pull requests

1 participant