Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default Prometheus metrics for Thrift server #657

Merged
merged 10 commits into from
Apr 12, 2022

Conversation

JessicaGreben
Copy link
Contributor

@JessicaGreben JessicaGreben commented Mar 28, 2022

This original PR was getting too big, so this PR splits out one part. There will be approx. 4 PRs split out in total:

  • 1st PR (this PR) adds default Prometheus metrics for Thrift server
  • 2nd PR will do the same for Thrift Client
  • 3rd PR will add default Prometheus metrics for HTTP server
  • 4th PR will do the same for HTTP client

Background:
We are adding support to export Prometheus metrics from baseplate.py services. As of baseplate.py v2.3.0, baseplate-serve also runs a server that exports Prometheus metrics on endpoint /metrics on port 6060.

This PR adds the default Prometheus metrics for Thrift servers. The default Thrift metrics to be exported are defined in the baseplate.spec.

Testing:
I make a test thrift app from baseplate-cookiecutter and tested out this new Prometheus code. Here are the steps I used to test:

  1. create thrift service with baseplate-cookiecutter (make sure baseplate v2.3.0 or greater) so that prom metrics are exported on port 6060 by default.
  2. make sure prometheus-client==0.12.0 is in the requirements.txt file
  3. set env var PROMETHEUS_MULTIPROC_DIR to a directory where metrics from multiprocess python can be written (this is already included in prod dockerfile)
  4. either set metrics.tagging or metrics.namespace in the service's ini config file.
  5. build thrift with make thrift then run baseplate-serve example.ini
  6. use the test-client to hit the is_healthy endpoint
  7. check the prometheus metrics that were created:
$ curl localhost:6060/metrics
# HELP thrift_server_active_requests Multiprocess metric
# TYPE thrift_server_active_requests gauge
thrift_server_active_requests{pid="23318",thrift_method="is_healthy"} 0.0
# HELP thrift_server_latency_seconds Multiprocess metric
# TYPE thrift_server_latency_seconds histogram
thrift_server_latency_seconds_sum{thrift_method="is_healthy",thrift_success="true"} 0.041624628
thrift_server_latency_seconds_bucket{le="0.0001",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.00025",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.000625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.0015625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.00390625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.009765625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.0244140625",thrift_method="is_healthy",thrift_success="true"} 0.0
thrift_server_latency_seconds_bucket{le="0.06103515625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="0.152587890625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="0.3814697265625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="0.95367431640625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="2.384185791015625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="5.9604644775390625",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="14.901161193847656",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_bucket{le="+Inf",thrift_method="is_healthy",thrift_success="true"} 1.0
thrift_server_latency_seconds_count{thrift_method="is_healthy",thrift_success="true"} 1.0
# HELP thrift_server_requests_total Multiprocess metric
# TYPE thrift_server_requests_total counter
thrift_server_requests_total{thrift_baseplate_status="",thrift_baseplate_status_code="",thrift_exception_type="",thrift_method="is_healthy",thrift_success="true"} 1.0

It appears the HELP text is clobbered in multiprocess mode, issue.

@JessicaGreben JessicaGreben requested a review from a team as a code owner March 28, 2022 17:57
@MelissaCole MelissaCole requested a review from nsheaps March 28, 2022 18:04
start_prometheus_exporter()
from baseplate.server.prometheus import start_prometheus_exporter

start_prometheus_exporter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sold on requiring prometheus on everything that uses baseplate libs. Was this discussed somewhere?

Copy link
Contributor Author

@JessicaGreben JessicaGreben Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the company is migrating to Prometheus from Wavefront, we want to add the same functionality that there is for the statsd exporter that ships metrics to Wavefront. The baseplate.spec documents what the baseplate frameworks should implement and there is a section for Prometheus metrics: https://github.snooguts.net/reddit/baseplate.spec/blob/master/component-apis/prom-metrics.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[discussed offline] Lets swap this back to the try block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsheaps A follow up from our previous discussions about replacing this try/except block with checking the config for metrics being enabled:

It turns out there are two different ways to enable the current statsd metrics (and you cannot set both at the same time):

  1. metrics.tagging enables the tagged metrics (described here)
  2. metrics.namespace enables non-tagged metrics (described here).

Searching in sourcegraph, it looks like there are approx. 82 repos that set metrics.namespace and 20 repos that set metric.tagging. For reference, here are the sourcegraph queries I ran:

repo:reddit-service-* repo:contains.file(requirements.txt) file:ini metrics.tagging select:repo
repo:reddit-service-* repo:contains.file(requirements.txt) file:ini metrics.namespace select:repo

Since the goal is to enable Prometheus metrics alongside the current statsd metrics (until wavefront is turned off), then we need to enable Prometheus when either metrics.tagging or metrics.namespace are set.

Another consideration is to create a new config option specifically for Prometheus. It would be nice to have a separate config for the Prometheus metrics to keep it untangled from the current metric configs options, however there wouldn't be a way to turn on Prometheus automatically without updating the .ini file which would make upgrades not backwards compatible. But I'm open to a discussion either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the info!

I think the way you have it implemented is fine, and easiest to upgrade to without requiring a change to configs.

I do see at least one project that actually has both of those configs, a difference between development and production.

I think a modified config section might be warranted in a future version, addressed separately, as the configurations for both metrics.namespace and metrics.tagging seem somewhat specific on where to send the metrics to, rather than prometheus' poll-based metrics. We can definitely address the implementation of it separately, but I'd picture the config to be somewhat pluggable, such as the below. The reason this didn't come up is the prometheus observer doesn't actually have a config at the moment, though I could imagine adding things like tls configurations. Rolling this out can be done more easily while not blocking this progress using a batch change later

metrics.enabled = true # used to be metrics.tagging, more understandable
metrics.whitelist = success, error, endpoint, client

# statsd/telegraf config
metrics.statsd.enabled = true
metrics.statsd.namespace = mine
metrics.statsd.endpoint = telegraf:8125
metrics.statsd.swallow_network_errors = true

# prometheus config
metrics.prometheus.enabled = true

Copy link
Contributor

@nsheaps nsheaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is one of my first baseplate PRs, can we hop on a call and go over it together? Sorry for the delay

@JessicaGreben
Copy link
Contributor Author

@nsheaps Thanks for the help with the review. Yes jumping on a call would be great. I will DM you to set it up.

@@ -203,7 +203,7 @@ def _call_thrift_method(self: Any, *args: Any, **kwargs: Any) -> Any:
except Error as exc:
# a 5xx error is an unexpected exception but not 5xx are
# not.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃

baseplate/frameworks/thrift/__init__.py Outdated Show resolved Hide resolved
def protocol(self) -> str:
return self.tags.get("protocol", "unknown")

def set_metrics_by_protocol(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been staring myself a bit blind trying to figure this out, so someone more familiar with baseplate should verify this. I think we're breaking the abstraction here. We should implement a Prometheus Span observer, but it should only be used for shipping the results of span traces to prometheus. The actual metrics specific to thrift or http should either be implemented as spans in the server (e.g. making a child span in the thrift server when handling a request), or - my favorite - just directly in the thrift or http server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this idea. i would love to implement this differently. i like the idea of doing the metrics directly in the server. less discuss more

@@ -237,6 +237,7 @@ def _on_new_request(self, event: pyramid.events.ContextFound) -> None:
name=request.matched_route.name,
trace_info=trace_info,
)
span.set_tag("protocol", "http")
Copy link
Contributor

@nsheaps nsheaps Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔕 does this need a https distinction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer: no

Copy link
Contributor

@nsheaps nsheaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline and got a full rundown of how this is working, future expectations and the like.

Few points to document:

  1. For reddit's purposes, this is more than fine, as we can make a solid call as to "we support thrift, HTTP and grpc" but the mechanism for "if span protocol is HTTP" could likely be improved if it needed to be more flexible
  2. Similarly, the span observers that are called from the switch statement are explicit due to the slight disparities in metric names between the traced protocol. Would be great to make a generic generator that you pass in the protocol and get one made for you (or a generic one that looks at the protocol and adjusts the metric names as necessary)
  3. splitting the 4 implementation tasks was very helpful for review, thank you
  4. GRPC is coming, but not in this PR, even in a "not implemented capacity" (like how HTTP is done here
  5. Some improvements we discussed are likely not work the effort for perfection considering the long term trend towards go
  6. Because of the explicit requirement of metrics collection, we finally decided that prometheus client is required if metrics are turned on, rather than an optional import that logs a warning

# mark 5xx errors as failures since those are still "unexpected"
if exc.code // 100 == 5:
if 500 <= exc.code < 600:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔕 what other errors come through this code path? Should they maybe be split out into specific exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.exc_info() will indicate what the exception is. I'm not sure what you mean by "split out"?

except TException:
except TException as e:
span.set_tag("exception_type", type(e).__name__)
span.set_tag("success", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔕 should there be a span.set_tag("success", "true") somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when span.finish() gets called its calls the on_finish method in the Prometheus observer and span.set_tag("success", "true") gets set if there are no exceptions.

start_prometheus_exporter()
from baseplate.server.prometheus import start_prometheus_exporter

start_prometheus_exporter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the info!

I think the way you have it implemented is fine, and easiest to upgrade to without requiring a change to configs.

I do see at least one project that actually has both of those configs, a difference between development and production.

I think a modified config section might be warranted in a future version, addressed separately, as the configurations for both metrics.namespace and metrics.tagging seem somewhat specific on where to send the metrics to, rather than prometheus' poll-based metrics. We can definitely address the implementation of it separately, but I'd picture the config to be somewhat pluggable, such as the below. The reason this didn't come up is the prometheus observer doesn't actually have a config at the moment, though I could imagine adding things like tls configurations. Rolling this out can be done more easily while not blocking this progress using a batch change later

metrics.enabled = true # used to be metrics.tagging, more understandable
metrics.whitelist = success, error, endpoint, client

# statsd/telegraf config
metrics.statsd.enabled = true
metrics.statsd.namespace = mine
metrics.statsd.endpoint = telegraf:8125
metrics.statsd.swallow_network_errors = true

# prometheus config
metrics.prometheus.enabled = true

@JessicaGreben JessicaGreben merged commit 192bd4c into reddit:develop Apr 12, 2022
@JessicaGreben JessicaGreben deleted the prom-metrics-thrift-server branch April 12, 2022 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants