Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DAG Run not being marked as successful due to failing telemetry call #1438

Closed
1 task done
mjohansenwork opened this issue Jan 2, 2025 · 2 comments · Fixed by #1439
Closed
1 task done

[Bug] DAG Run not being marked as successful due to failing telemetry call #1438

mjohansenwork opened this issue Jan 2, 2025 · 2 comments · Fixed by #1439
Assignees
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone

Comments

@mjohansenwork
Copy link

Astronomer Cosmos Version

1.8.0

dbt-core version

1.8.6

Versions of dbt adapters

No response

LoadMode

AUTOMATIC

ExecutionMode

AWS_EKS

InvocationMode

None

airflow version

2.10.3

Operating System

Astro docker image

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Other Docker-based deployment

Deployment details

Astronomer local docker image

What happened?

After updating cosmos from 1.7.0 to 1.8.0, the DAG run is not being marked as successful even after the final task succeeds. The scheduler logs show an HTTPS error when the DAG run telemetry listener tries to submit telemetry data. This is likely because the telemetry URL is not permitted by the firewall on my machine. The issue appears to be related to the telemetry support added in #1397. Specifically, it seems like the on_dag_run_success() hook throws an exception when the HTTPS call fails, which prevents Airflow from proceeding with the DAG run lifecycle and actually marking the DAG run as successful.

I tried setting the env var DO_NOT_TRACK=True to disable telemetry collection and that seems to have resolved the issue.

Relevant log output

File "/usr/local/lib/python3.12/site-packages/cosmos/listeners/dag_run_listener.py", line 60, in on_dag_run_success
     telemetry.emit_usage_metrics_if_enabled(DAG_RUN, additional_telemetry_metrics)
   File "/usr/local/lib/python3.12/site-packages/cosmos/telemetry.py", line 73, in emit_usage_metrics_if_enabled
     is_success = emit_usage_metrics(metrics)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/cosmos/telemetry.py", line 50, in emit_usage_metrics
     response = httpx.get(telemetry_url, timeout=constants.TELEMETRY_TIMEOUT, follow_redirects=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_api.py", line 198, in get
     return request(
            ^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_api.py", line 106, in request
     return client.request(
            ^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 827, in request
     return self.send(request, auth=auth, follow_redirects=follow_redirects)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 914, in send
     response = self._send_handling_auth(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 942, in _send_handling_auth
     response = self._send_handling_redirects(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
     response = self._send_single_request(request)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1015, in _send_single_request
     response = transport.handle_request(request)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 232, in handle_request
     with map_httpcore_exceptions():
          ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
     self.gen.throw(value)
   File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
     raise mapped_exc(message) from exc
 httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)

How to reproduce

This depends on the particular firewall configuration on our machines, so it's likely not easily reproducible by external parties.

Anything else :)?

It looks like there is some error handling code here https://github.com/astronomer/astronomer-cosmos/pull/1397/files#diff-e39094327c419564d75b9530a764f213c57e83b62492e39d3fd042344b779458R50 but it doesn't seem to handle if httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] is raised by the http call. Perhaps the call needs to be wrapped in a try/except block.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

No response

@mjohansenwork mjohansenwork added bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone labels Jan 2, 2025
@mjohansenwork mjohansenwork changed the title [Bug] [Bug] DAG Run not being marked as successful due to failing telemetry call Jan 2, 2025
Copy link

dosubot bot commented Jan 2, 2025

The issue you're experiencing is related to the telemetry feature in Cosmos, which attempts to send data over HTTPS. The error occurs because the on_dag_run_success() hook does not handle httpx.ConnectError exceptions, specifically SSL certificate verification failures, which prevents the DAG run from being marked as successful.

To resolve this, you can disable telemetry by setting the environment variable DO_NOT_TRACK=True, which you've already done successfully. This prevents the telemetry data from being sent, allowing the DAG run to complete without errors.

For a more permanent solution, you could improve the error handling in the emit_usage_metrics function by wrapping the HTTP request in a try/except block to catch httpx.ConnectError exceptions. This would prevent such errors from affecting the DAG run lifecycle. If you're willing to submit a pull request, you could implement this change to enhance the resilience of the telemetry feature.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc label Jan 2, 2025
@tatiana tatiana self-assigned this Jan 3, 2025
tatiana added a commit that referenced this issue Jan 3, 2025
Closes: #1438

A user reported httpx raised an exception while trying to emit telemetry, leading the DAG to not being successful.
This PR aims to solve the issue:

```
File "/usr/local/lib/python3.12/site-packages/cosmos/listeners/dag_run_listener.py", line 60, in on_dag_run_success
     telemetry.emit_usage_metrics_if_enabled(DAG_RUN, additional_telemetry_metrics)
   File "/usr/local/lib/python3.12/site-packages/cosmos/telemetry.py", line 73, in emit_usage_metrics_if_enabled
     is_success = emit_usage_metrics(metrics)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/cosmos/telemetry.py", line 50, in emit_usage_metrics
     response = httpx.get(telemetry_url, timeout=constants.TELEMETRY_TIMEOUT, follow_redirects=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_api.py", line 198, in get
     return request(
            ^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_api.py", line 106, in request
     return client.request(
            ^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 827, in request
     return self.send(request, auth=auth, follow_redirects=follow_redirects)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 914, in send
     response = self._send_handling_auth(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 942, in _send_handling_auth
     response = self._send_handling_redirects(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
     response = self._send_single_request(request)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1015, in _send_single_request
     response = transport.handle_request(request)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 232, in handle_request
     with map_httpcore_exceptions():
          ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
     self.gen.throw(value)
   File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
     raise mapped_exc(message) from exc
 httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)
```
@tatiana
Copy link
Collaborator

tatiana commented Jan 3, 2025

Thank you very much for reporting this, @mjohansenwork !
I just made a PR with the fix: #1439, we'll aim to release this early next week, as part of Cosmos 1.8.2.

@tatiana tatiana closed this as completed in e002858 Jan 3, 2025
tatiana added a commit to astronomer/dag-factory that referenced this issue Jan 3, 2025
Handle errors similar to:
```
   File "/usr/local/lib/python3.12/site-packages/httpx/_api.py", line 198, in get
     return request(
            ^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_api.py", line 106, in request
     return client.request(
            ^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 827, in request
     return self.send(request, auth=auth, follow_redirects=follow_redirects)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 914, in send
     response = self._send_handling_auth(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 942, in _send_handling_auth
     response = self._send_handling_redirects(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
     response = self._send_single_request(request)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_client.py", line 1015, in _send_single_request
     response = transport.handle_request(request)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 232, in handle_request
     with map_httpcore_exceptions():
          ^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/contextlib.py", line 158, in __exit__
     self.gen.throw(value)
   File "/usr/local/lib/python3.12/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
     raise mapped_exc(message) from exc
 httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)
```

As observed in Cosmos:
astronomer/astronomer-cosmos#1438
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants