Add default Prometheus metrics for Thrift server #657

JessicaGreben · 2022-03-28T17:57:01Z

This original PR was getting too big, so this PR splits out one part. There will be approx. 4 PRs split out in total:

1st PR (this PR) adds default Prometheus metrics for Thrift server
2nd PR will do the same for Thrift Client
3rd PR will add default Prometheus metrics for HTTP server
4th PR will do the same for HTTP client

Background:
We are adding support to export Prometheus metrics from baseplate.py services. As of baseplate.py v2.3.0, baseplate-serve also runs a server that exports Prometheus metrics on endpoint /metrics on port 6060.

This PR adds the default Prometheus metrics for Thrift servers. The default Thrift metrics to be exported are defined in the baseplate.spec.

Testing:
I make a test thrift app from baseplate-cookiecutter and tested out this new Prometheus code. Here are the steps I used to test:

create thrift service with baseplate-cookiecutter (make sure baseplate v2.3.0 or greater) so that prom metrics are exported on port 6060 by default.
make sure prometheus-client==0.12.0 is in the requirements.txt file
set env var PROMETHEUS_MULTIPROC_DIR to a directory where metrics from multiprocess python can be written (this is already included in prod dockerfile)
either set metrics.tagging or metrics.namespace in the service's ini config file.
build thrift with make thrift then run baseplate-serve example.ini
use the test-client to hit the is_healthy endpoint
check the prometheus metrics that were created:

$ curl localhost:6060/metrics
# HELP thrift_server_active_requests Multiprocess metric
# TYPE thrift_server_active_requests gauge
thrift_server_active_requests{pid="23318",thrift_method="is_healthy"} 0.0
# HELP thrift_server_latency_seconds Multiprocess metric
# TYPE thrift_server_latency_seconds histogram
thrift_server_latency_seconds_sum{thrift_method="is_healthy",thrift_success="true"} 0.041624628
thrift_server_latency_seconds_bucket{le="0.0001",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.00025",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.000625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.0015625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.00390625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.009765625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.0244140625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.06103515625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="0.152587890625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="0.3814697265625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="0.95367431640625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="2.384185791015625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="5.9604644775390625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="14.901161193847656",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="+Inf",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_count{thrift_method="is_healthy",thrift_success="true"} 1.0
# HELP thrift_server_requests_total Multiprocess metric
# TYPE thrift_server_requests_total counter
thrift_server_requests_total{thrift_baseplate_status="",thrift_baseplate_status_code="",thrift_exception_type="",thrift_method="is_healthy",thrift_success="true"} 1.0

It appears the HELP text is clobbered in multiprocess mode, issue.

nsheaps · 2022-03-30T21:35:00Z

baseplate/server/__init__.py

-        start_prometheus_exporter()
+    from baseplate.server.prometheus import start_prometheus_exporter
+
+    start_prometheus_exporter()


I'm not totally sold on requiring prometheus on everything that uses baseplate libs. Was this discussed somewhere?

Since the company is migrating to Prometheus from Wavefront, we want to add the same functionality that there is for the statsd exporter that ships metrics to Wavefront. The baseplate.spec documents what the baseplate frameworks should implement and there is a section for Prometheus metrics: https://github.snooguts.net/reddit/baseplate.spec/blob/master/component-apis/prom-metrics.md

[discussed offline] Lets swap this back to the try block

@nsheaps A follow up from our previous discussions about replacing this try/except block with checking the config for metrics being enabled:

It turns out there are two different ways to enable the current statsd metrics (and you cannot set both at the same time):

metrics.tagging enables the tagged metrics (described here)

metrics.namespace enables non-tagged metrics (described here).

Searching in sourcegraph, it looks like there are approx. 82 repos that set metrics.namespace and 20 repos that set metric.tagging. For reference, here are the sourcegraph queries I ran:

repo:reddit-service-* repo:contains.file(requirements.txt) file:ini metrics.tagging select:repo repo:reddit-service-* repo:contains.file(requirements.txt) file:ini metrics.namespace select:repo

Since the goal is to enable Prometheus metrics alongside the current statsd metrics (until wavefront is turned off), then we need to enable Prometheus when either metrics.tagging or metrics.namespace are set.

Another consideration is to create a new config option specifically for Prometheus. It would be nice to have a separate config for the Prometheus metrics to keep it untangled from the current metric configs options, however there wouldn't be a way to turn on Prometheus automatically without updating the .ini file which would make upgrades not backwards compatible. But I'm open to a discussion either way.

Thanks for all the info!

I think the way you have it implemented is fine, and easiest to upgrade to without requiring a change to configs.

I do see at least one project that actually has both of those configs, a difference between development and production.

I think a modified config section might be warranted in a future version, addressed separately, as the configurations for both metrics.namespace and metrics.tagging seem somewhat specific on where to send the metrics to, rather than prometheus' poll-based metrics. We can definitely address the implementation of it separately, but I'd picture the config to be somewhat pluggable, such as the below. The reason this didn't come up is the prometheus observer doesn't actually have a config at the moment, though I could imagine adding things like tls configurations. Rolling this out can be done more easily while not blocking this progress using a batch change later

metrics.enabled = true # used to be metrics.tagging, more understandable metrics.whitelist = success, error, endpoint, client # statsd/telegraf config metrics.statsd.enabled = true metrics.statsd.namespace = mine metrics.statsd.endpoint = telegraf:8125 metrics.statsd.swallow_network_errors = true # prometheus config metrics.prometheus.enabled = true

nsheaps

Since this is one of my first baseplate PRs, can we hop on a call and go over it together? Sorry for the delay

JessicaGreben · 2022-03-30T22:14:12Z

@nsheaps Thanks for the help with the review. Yes jumping on a call would be great. I will DM you to set it up.

krav · 2022-03-31T09:03:38Z

baseplate/clients/thrift.py

@@ -203,7 +203,7 @@ def _call_thrift_method(self: Any, *args: Any, **kwargs: Any) -> Any:
            except Error as exc:
                # a 5xx error is an unexpected exception but not 5xx are
                # not.


baseplate/frameworks/thrift/__init__.py

krav · 2022-03-31T13:23:31Z

baseplate/observers/prometheus.py

+    def protocol(self) -> str:
+        return self.tags.get("protocol", "unknown")
+
+    def set_metrics_by_protocol(self) -> None:


I've been staring myself a bit blind trying to figure this out, so someone more familiar with baseplate should verify this. I think we're breaking the abstraction here. We should implement a Prometheus Span observer, but it should only be used for shipping the results of span traces to prometheus. The actual metrics specific to thrift or http should either be implemented as spans in the server (e.g. making a child span in the thrift server when handling a request), or - my favorite - just directly in the thrift or http server.

i like this idea. i would love to implement this differently. i like the idea of doing the metrics directly in the server. less discuss more

nsheaps · 2022-03-31T14:59:48Z

baseplate/frameworks/pyramid/__init__.py

@@ -237,6 +237,7 @@ def _on_new_request(self, event: pyramid.events.ContextFound) -> None:
            name=request.matched_route.name,
            trace_info=trace_info,
        )
+        span.set_tag("protocol", "http")


🔕 does this need a https distinction?

nsheaps

Spoke offline and got a full rundown of how this is working, future expectations and the like.

Few points to document:

For reddit's purposes, this is more than fine, as we can make a solid call as to "we support thrift, HTTP and grpc" but the mechanism for "if span protocol is HTTP" could likely be improved if it needed to be more flexible
Similarly, the span observers that are called from the switch statement are explicit due to the slight disparities in metric names between the traced protocol. Would be great to make a generic generator that you pass in the protocol and get one made for you (or a generic one that looks at the protocol and adjusts the metric names as necessary)
splitting the 4 implementation tasks was very helpful for review, thank you
GRPC is coming, but not in this PR, even in a "not implemented capacity" (like how HTTP is done here
Some improvements we discussed are likely not work the effort for perfection considering the long term trend towards go
Because of the explicit requirement of metrics collection, we finally decided that prometheus client is required if metrics are turned on, rather than an optional import that logs a warning

nsheaps · 2022-04-12T01:14:08Z

baseplate/frameworks/thrift/__init__.py

                # mark 5xx errors as failures since those are still "unexpected"
-                if exc.code // 100 == 5:
+                if 500 <= exc.code < 600:


🔕 what other errors come through this code path? Should they maybe be split out into specific exceptions?

sys.exc_info() will indicate what the exception is. I'm not sure what you mean by "split out"?

nsheaps · 2022-04-12T01:15:02Z

baseplate/frameworks/thrift/__init__.py

-            except TException:
+            except TException as e:
+                span.set_tag("exception_type", type(e).__name__)
+                span.set_tag("success", "false")


🔕 should there be a span.set_tag("success", "true") somewhere?

when span.finish() gets called its calls the on_finish method in the Prometheus observer and span.set_tag("success", "true") gets set if there are no exceptions.

nsheaps · 2022-04-12T01:39:13Z

baseplate/server/__init__.py

-        start_prometheus_exporter()
+    from baseplate.server.prometheus import start_prometheus_exporter
+
+    start_prometheus_exporter()


Thanks for all the info!

I think the way you have it implemented is fine, and easiest to upgrade to without requiring a change to configs.

I do see at least one project that actually has both of those configs, a difference between development and production.

I think a modified config section might be warranted in a future version, addressed separately, as the configurations for both metrics.namespace and metrics.tagging seem somewhat specific on where to send the metrics to, rather than prometheus' poll-based metrics. We can definitely address the implementation of it separately, but I'd picture the config to be somewhat pluggable, such as the below. The reason this didn't come up is the prometheus observer doesn't actually have a config at the moment, though I could imagine adding things like tls configurations. Rolling this out can be done more easily while not blocking this progress using a batch change later

metrics.enabled = true # used to be metrics.tagging, more understandable metrics.whitelist = success, error, endpoint, client # statsd/telegraf config metrics.statsd.enabled = true metrics.statsd.namespace = mine metrics.statsd.endpoint = telegraf:8125 metrics.statsd.swallow_network_errors = true # prometheus config metrics.prometheus.enabled = true

add default Prometheus metrics for Thrift server

469cf2f

JessicaGreben requested a review from a team as a code owner March 28, 2022 17:57

JessicaGreben requested review from bradengroom and MelissaCole March 28, 2022 17:57

MelissaCole requested a review from nsheaps March 28, 2022 18:04

jessica.grebenschikov added 8 commits March 28, 2022 12:46

add bp.Error test

79774ec

fix merge conflicts

c94ce52

lint fixes

5d13f5d

more lint fixes

1062d58

yet more lint fixes

0b31505

flake8 fixes

5a39cad

fix circular import

d8dc92d

fix test

020a5ee

nsheaps reviewed Mar 30, 2022

View reviewed changes

krav reviewed Mar 31, 2022

View reviewed changes

nsheaps reviewed Mar 31, 2022

View reviewed changes

nsheaps approved these changes Mar 31, 2022

View reviewed changes

a few changes per code review

df6fb53

nsheaps approved these changes Apr 12, 2022

View reviewed changes

JessicaGreben merged commit 192bd4c into reddit:develop Apr 12, 2022

JessicaGreben deleted the prom-metrics-thrift-server branch April 12, 2022 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add default Prometheus metrics for Thrift server #657

Add default Prometheus metrics for Thrift server #657

JessicaGreben commented Mar 28, 2022 •

edited

Loading

nsheaps Mar 30, 2022

JessicaGreben Mar 30, 2022 •

edited

Loading

nsheaps Mar 31, 2022

JessicaGreben Apr 1, 2022

nsheaps Apr 12, 2022

nsheaps left a comment

JessicaGreben commented Mar 30, 2022

krav Mar 31, 2022

krav Mar 31, 2022

JessicaGreben Mar 31, 2022

nsheaps Mar 31, 2022 •

edited

Loading

nsheaps Mar 31, 2022

nsheaps left a comment

nsheaps Apr 12, 2022

JessicaGreben Apr 12, 2022

nsheaps Apr 12, 2022

JessicaGreben Apr 12, 2022

nsheaps Apr 12, 2022

Add default Prometheus metrics for Thrift server #657

Add default Prometheus metrics for Thrift server #657

Conversation

JessicaGreben commented Mar 28, 2022 • edited Loading

Choose a reason for hiding this comment

JessicaGreben Mar 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsheaps left a comment

Choose a reason for hiding this comment

JessicaGreben commented Mar 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsheaps Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsheaps left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JessicaGreben commented Mar 28, 2022 •

edited

Loading

JessicaGreben Mar 30, 2022 •

edited

Loading

nsheaps Mar 31, 2022 •

edited

Loading