Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backend): add e2e useful tracing #2037

Merged
merged 2 commits into from
Apr 10, 2024
Merged

Conversation

fatih-acar
Copy link
Contributor

@fatih-acar fatih-acar commented Jan 25, 2024

Enabled tracing component inside git agent in order to have end to end spans for a single request.
Added spans at most useful places such as database execute_query.

You can use the jaeger all-in-one container (or grafana tempo, as you wish :)) in order to try it out, don't forget to enable tracing in the config.

Note the custom span I added as an example on BranchCreate mutation. You can either use the decorator or use the longer "start_as_current_span" context method if you want custom dynamic attributes.

Note: I will leave this as draft since the internet says the opentracing lib slows down async apps, I'd like to ensure there is no perf regression with this one...

This is how it renders on jaeger:

image

Copy link
Collaborator

@dgarros dgarros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great improvement, well done

It's probably nothing but just in case .. Looking at open-telemetry/opentelemetry-python-contrib#1835 (comment) it looks like we need to add an ENTRYPOINT but it looks like this is not part of this PR ... is that on purpose or should we add it ?

@dgarros dgarros requested a review from ogenstad January 26, 2024 07:39
@ogenstad
Copy link
Contributor

Nice :) I'll take a closer look but first a comment and a question.

Type hints

We get a type hint error due to the endpoint, we already have the same thing in infrahub/server.py but that file is ignored. I think this should be fixed on the config side i.e. infrahub/config.py so that either the endpoint is required or that we use a property that returns a string. (This could also be handled as a separate PR as it's not related to your work here)

Asyncio

As you mention in the issue text there might be some async/sync concerns. When setting up this tracing from the beginning we left it off by default in the config, I have some memory that the library used for this wasn't using asyncio and that it needed to export traces to an external URL. I haven't looked into the libraries used for this but can at least see that we're now executing the messages within a synchronous context manager, would this mean that we'd be sending any synchronous requests when we do that? We already had a sync context manager for the db queries but that part didn't touch the network it only allowed information to be gathered from the API server (or http endpoint on the git-agents).

If we could abstract with get_tracer(__name__).start_as_current_span("execute_message") as span: into something that get's replaced by depending on how the config is setup I think we could perhaps move forward regardless. I.e. if tracing wasn't enabled we'd just use a dummy context manager that didn't actually do anything. But it would be great if there was some work to add proper asyncio support, have you seen any indication that this is being worked on?

@fatih-acar
Copy link
Contributor Author

Looks like a great improvement, well done

It's probably nothing but just in case .. Looking at open-telemetry/opentelemetry-python-contrib#1835 (comment) it looks like we need to add an ENTRYPOINT but it looks like this is not part of this PR ... is that on purpose or should we add it ?

This is required to apply the monkeypatch through Poetry but I couldn't make it work.
I used a manual call to the patch function instead

@fatih-acar
Copy link
Contributor Author

Thanks Patrick :)

Asyncio

As you mention in the issue text there might be some async/sync concerns. When setting up this tracing from the beginning we left it off by default in the config, I have some memory that the library used for this wasn't using asyncio and that it needed to export traces to an external URL. I haven't looked into the libraries used for this but can at least see that we're now executing the messages within a synchronous context manager, would this mean that we'd be sending any synchronous requests when we do that? We already had a sync context manager for the db queries but that part didn't touch the network it only allowed information to be gathered from the API server (or http endpoint on the git-agents).

If we could abstract with get_tracer(__name__).start_as_current_span("execute_message") as span: into something that get's replaced by depending on how the config is setup I think we could perhaps move forward regardless. I.e. if tracing wasn't enabled we'd just use a dummy context manager that didn't actually do anything. But it would be great if there was some work to add proper asyncio support, have you seen any indication that this is being worked on?

Yes, this is exactly my point. The otel library cannot use async context managers yet so as you mentioned, traces seem exported synchronously that would lead to performance drop. I'll take a look and maybe implement what you proposed if we can't make much progress. At least the feature would be available for debugging purposes even if performance is worse when enabled.

@fatih-acar
Copy link
Contributor Author

Some inputs for later:

WIP PR to have the async decorator upstream (so we may remove the third party otel_extensions lib): open-telemetry/opentelemetry-python#3633

For the sync context issue, this might be the solution but still have to confirm perfs ourselves: open-telemetry/opentelemetry-python#62 (comment)

@github-actions github-actions bot added the group/ci Issue related to the CI pipeline label Mar 31, 2024
@fatih-acar fatih-acar force-pushed the fac-improve-tracing branch 4 times, most recently from 2abd296 to 8e109ae Compare March 31, 2024 15:43
@fatih-acar
Copy link
Contributor Author

fatih-acar commented Apr 1, 2024

I managed to remove the dependency to the otel-extensions third party library now that https://github.com/open-telemetry/opentelemetry-python/releases/tag/v1.24.0 has been released.

I don't see any performance issue in the different test suites we have: we make use of BatchSpanExporter that tries to batch the exports as the name suggests so the performance overhead should be minimal if we don't have too much concurrent requests (I guess it can become problematic in this case). There is WIP to implement an async exporter: open-telemetry/opentelemetry-python#3489 and here's a list of the different upstream tasks around that: open-telemetry/opentelemetry-python#62 (comment) (not up-to-date)

We can either wait for this last feature to be done upstream, or merge this as-is... I don't have a strong opinion on this one, but at least the feature is ready.

I have also enabled tracing for the CI e2e tests.

And the message bus RPCs are now auto instrumented (less custom code) thanks to 501a1f8

@fatih-acar fatih-acar marked this pull request as ready for review April 1, 2024 09:15
@ogenstad
Copy link
Contributor

ogenstad commented Apr 9, 2024

@fatih-acar, is the tooling around this in a state where you'd want to merge it now? I missed that it was marked for review again. :(

As it's semi related could be nice with the cleanup required to remove this from pyproject.toml (not required just nice :) )

[[tool.mypy.overrides]]
module = "infrahub.server"
ignore_errors = true

Copy link

codspeed-hq bot commented Apr 9, 2024

CodSpeed Performance Report

Merging #2037 will not alter performance

Comparing fac-improve-tracing (44cbe7b) with develop (bb765ca)

Summary

✅ 12 untouched benchmarks

@fatih-acar
Copy link
Contributor Author

@ogenstad yes as I have stated in my previous comment, I think we can merge this now since I've removed the 3rd party lib I previously used. There's still this async exporter issue upstream that won't be resolved anytime soon I guess... but batching should help avoid perf issues.

I have removed the mypy ignore statement, tests are still passing :)

Copy link
Contributor

@ogenstad ogenstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for the delay.

backend/infrahub/config.py Outdated Show resolved Hide resolved
@fatih-acar fatih-acar force-pushed the fac-improve-tracing branch 5 times, most recently from fa756d9 to ace1aae Compare April 9, 2024 21:16
Enabled tracing component inside git agent in order to have end to end
spans for a single request.
Added spans at most useful places such as message bus' execute_message
and database execute_query.

Signed-off-by: Fatih Acar <fatih@opsmill.com>
Signed-off-by: Fatih Acar <fatih@opsmill.com>
@fatih-acar fatih-acar merged commit d8d35c7 into develop Apr 10, 2024
44 checks passed
@fatih-acar fatih-acar deleted the fac-improve-tracing branch April 10, 2024 13:25
@fatih-acar fatih-acar added the type/feature New feature or request label Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
group/backend Issue related to the backend (API Server, Git Agent) group/ci Issue related to the CI pipeline type/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants