-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add default Prometheus metrics for Thrift server #657
Add default Prometheus metrics for Thrift server #657
Conversation
baseplate/server/__init__.py
Outdated
start_prometheus_exporter() | ||
from baseplate.server.prometheus import start_prometheus_exporter | ||
|
||
start_prometheus_exporter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sold on requiring prometheus on everything that uses baseplate libs. Was this discussed somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the company is migrating to Prometheus from Wavefront, we want to add the same functionality that there is for the statsd exporter that ships metrics to Wavefront. The baseplate.spec documents what the baseplate frameworks should implement and there is a section for Prometheus metrics: https://github.snooguts.net/reddit/baseplate.spec/blob/master/component-apis/prom-metrics.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[discussed offline] Lets swap this back to the try block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsheaps A follow up from our previous discussions about replacing this try/except
block with checking the config for metrics being enabled:
It turns out there are two different ways to enable the current statsd metrics (and you cannot set both at the same time):
metrics.tagging
enables the tagged metrics (described here)metrics.namespace
enables non-tagged metrics (described here).
Searching in sourcegraph, it looks like there are approx. 82 repos that set metrics.namespace
and 20 repos that set metric.tagging
. For reference, here are the sourcegraph queries I ran:
repo:reddit-service-* repo:contains.file(requirements.txt) file:ini metrics.tagging select:repo
repo:reddit-service-* repo:contains.file(requirements.txt) file:ini metrics.namespace select:repo
Since the goal is to enable Prometheus metrics alongside the current statsd metrics (until wavefront is turned off), then we need to enable Prometheus when either metrics.tagging
or metrics.namespace
are set.
Another consideration is to create a new config option specifically for Prometheus. It would be nice to have a separate config for the Prometheus metrics to keep it untangled from the current metric configs options, however there wouldn't be a way to turn on Prometheus automatically without updating the .ini
file which would make upgrades not backwards compatible. But I'm open to a discussion either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the info!
I think the way you have it implemented is fine, and easiest to upgrade to without requiring a change to configs.
I do see at least one project that actually has both of those configs, a difference between development and production.
I think a modified config section might be warranted in a future version, addressed separately, as the configurations for both metrics.namespace and metrics.tagging seem somewhat specific on where to send the metrics to, rather than prometheus' poll-based metrics. We can definitely address the implementation of it separately, but I'd picture the config to be somewhat pluggable, such as the below. The reason this didn't come up is the prometheus observer doesn't actually have a config at the moment, though I could imagine adding things like tls configurations. Rolling this out can be done more easily while not blocking this progress using a batch change later
metrics.enabled = true # used to be metrics.tagging, more understandable
metrics.whitelist = success, error, endpoint, client
# statsd/telegraf config
metrics.statsd.enabled = true
metrics.statsd.namespace = mine
metrics.statsd.endpoint = telegraf:8125
metrics.statsd.swallow_network_errors = true
# prometheus config
metrics.prometheus.enabled = true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is one of my first baseplate PRs, can we hop on a call and go over it together? Sorry for the delay
@nsheaps Thanks for the help with the review. Yes jumping on a call would be great. I will DM you to set it up. |
@@ -203,7 +203,7 @@ def _call_thrift_method(self: Any, *args: Any, **kwargs: Any) -> Any: | |||
except Error as exc: | |||
# a 5xx error is an unexpected exception but not 5xx are | |||
# not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃
def protocol(self) -> str: | ||
return self.tags.get("protocol", "unknown") | ||
|
||
def set_metrics_by_protocol(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been staring myself a bit blind trying to figure this out, so someone more familiar with baseplate should verify this. I think we're breaking the abstraction here. We should implement a Prometheus Span observer, but it should only be used for shipping the results of span traces to prometheus. The actual metrics specific to thrift or http should either be implemented as spans in the server (e.g. making a child span in the thrift server when handling a request), or - my favorite - just directly in the thrift or http server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this idea. i would love to implement this differently. i like the idea of doing the metrics directly in the server. less discuss more
@@ -237,6 +237,7 @@ def _on_new_request(self, event: pyramid.events.ContextFound) -> None: | |||
name=request.matched_route.name, | |||
trace_info=trace_info, | |||
) | |||
span.set_tag("protocol", "http") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔕 does this need a https distinction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answer: no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke offline and got a full rundown of how this is working, future expectations and the like.
Few points to document:
- For reddit's purposes, this is more than fine, as we can make a solid call as to "we support thrift, HTTP and grpc" but the mechanism for "if span protocol is HTTP" could likely be improved if it needed to be more flexible
- Similarly, the span observers that are called from the switch statement are explicit due to the slight disparities in metric names between the traced protocol. Would be great to make a generic generator that you pass in the protocol and get one made for you (or a generic one that looks at the protocol and adjusts the metric names as necessary)
- splitting the 4 implementation tasks was very helpful for review, thank you
- GRPC is coming, but not in this PR, even in a "not implemented capacity" (like how HTTP is done here
- Some improvements we discussed are likely not work the effort for perfection considering the long term trend towards go
- Because of the explicit requirement of metrics collection, we finally decided that prometheus client is required if metrics are turned on, rather than an optional import that logs a warning
# mark 5xx errors as failures since those are still "unexpected" | ||
if exc.code // 100 == 5: | ||
if 500 <= exc.code < 600: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔕 what other errors come through this code path? Should they maybe be split out into specific exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys.exc_info()
will indicate what the exception is. I'm not sure what you mean by "split out"?
except TException: | ||
except TException as e: | ||
span.set_tag("exception_type", type(e).__name__) | ||
span.set_tag("success", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔕 should there be a span.set_tag("success", "true")
somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when span.finish()
gets called its calls the on_finish
method in the Prometheus observer and span.set_tag("success", "true")
gets set if there are no exceptions.
baseplate/server/__init__.py
Outdated
start_prometheus_exporter() | ||
from baseplate.server.prometheus import start_prometheus_exporter | ||
|
||
start_prometheus_exporter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the info!
I think the way you have it implemented is fine, and easiest to upgrade to without requiring a change to configs.
I do see at least one project that actually has both of those configs, a difference between development and production.
I think a modified config section might be warranted in a future version, addressed separately, as the configurations for both metrics.namespace and metrics.tagging seem somewhat specific on where to send the metrics to, rather than prometheus' poll-based metrics. We can definitely address the implementation of it separately, but I'd picture the config to be somewhat pluggable, such as the below. The reason this didn't come up is the prometheus observer doesn't actually have a config at the moment, though I could imagine adding things like tls configurations. Rolling this out can be done more easily while not blocking this progress using a batch change later
metrics.enabled = true # used to be metrics.tagging, more understandable
metrics.whitelist = success, error, endpoint, client
# statsd/telegraf config
metrics.statsd.enabled = true
metrics.statsd.namespace = mine
metrics.statsd.endpoint = telegraf:8125
metrics.statsd.swallow_network_errors = true
# prometheus config
metrics.prometheus.enabled = true
This original PR was getting too big, so this PR splits out one part. There will be approx. 4 PRs split out in total:
Background:
We are adding support to export Prometheus metrics from baseplate.py services. As of baseplate.py v2.3.0,
baseplate-serve
also runs a server that exports Prometheus metrics on endpoint/metrics
on port 6060.This PR adds the default Prometheus metrics for Thrift servers. The default Thrift metrics to be exported are defined in the baseplate.spec.
Testing:
I make a test thrift app from baseplate-cookiecutter and tested out this new Prometheus code. Here are the steps I used to test:
prometheus-client==0.12.0
is in the requirements.txt filePROMETHEUS_MULTIPROC_DIR
to a directory where metrics from multiprocess python can be written (this is already included in prod dockerfile)metrics.tagging
ormetrics.namespace
in the service's ini config file.make thrift
then runbaseplate-serve example.ini
is_healthy
endpointIt appears the
HELP
text is clobbered in multiprocess mode, issue.