Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Expose Rest API to return metrics in prometheus format #21599

Closed
wants to merge 6 commits into from

Conversation

karteekmurthys
Copy link
Contributor

@karteekmurthys karteekmurthys commented Dec 22, 2023

Description

This PR is still a work-in-progress.
There are 2 approaches implemented in this PR:

  1. This PR implements the BaseStatsReporter interface as StatsReporterImpl. The current design of StatsReporterImpl holds the metrics in-memory and exposes a method getMetrics(MetricsSerializer) that serializes the metrics with the serializer passed down to the function call. MetricsSerializer is an interface that users can implement to customize metric serialization. This approach doesn't support Histogram

  2. Implements PrometheusReporter using the prometheus-cpp library. The library has interfaces to create all metric types (COUNTER, GAUGE, Histograms and Summaries). The library exposes a Registry which can be used to hold the references to the counters. Also, a textSerializer interface that can convert the metrics into Prometheus Data Model.

We may chose one of the above or both.

To the outside world prestissimo exposes a rest API /v1/info/health/metrics, which in turn uses StatsReporterImpl::getMetrics(MetricsSerializer).

The Prometheus server must be configured to regularly probe this Prestissimo rest API to collect metrics.

Test Plan

Added a unit test class for StatsReporterImpl and also PrometheusReporter.
Also, Tested REST API locally and results are attached here.
prometheus-format-metrics.txt using prometheus-cpp library

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
== NO RELEASE NOTE ==

@karteekmurthys karteekmurthys requested a review from a team as a code owner December 22, 2023 23:33
Copy link

linux-foundation-easycla bot commented Dec 22, 2023

CLA Missing ID CLA Not Signed

@tdcmeehan
Copy link
Contributor

Ahana is not a great name for this. How about PrometheusStatsReporter?

@@ -994,6 +1003,11 @@ void PrestoServer::reportServerInfo(proxygen::ResponseHandler* downstream) {
http::sendOkResponse(downstream, json(serverInfo));
}

void PrestoServer::reportHealthMetrics(proxygen::ResponseHandler* downstream) {
auto reporter = std::dynamic_pointer_cast<ahana::PrometheusStatsReporter>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the ahana namespace.


namespace ahana {

class AhanaStatsReporterTest : public testing::Test {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the use of the ahana namespace and use PrometheusStatsReporterTest as the name of the test instead.


#include "PrometheusStatsReporter.h"

namespace ahana {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should be under facebook::presto namespace.

const std::vector<int32_t>& /* pcts */) const override {}

void addMetricValue(const std::string& key, size_t value = 1)
const override{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an empty method ? We should do the same logic as the other addMetricValue methods.

@karteekmurthys karteekmurthys force-pushed the prestissimo-metrics branch 2 times, most recently from bdff158 to 13e58ac Compare December 27, 2023 07:34
@karteekmurthys karteekmurthys changed the title Expose Rest API to return metrics in prometheus format [WIP] Expose Rest API to return metrics in prometheus format Dec 27, 2023
@karteekmurthys karteekmurthys marked this pull request as draft December 27, 2023 07:55
@aditi-pandit
Copy link
Contributor

@karteekmurthys : Please add some documentation. The doc should have information about:

  • Background on Prometheus metrics. What do gauge and counter metrics mean ? Add some screenshots to help the reviewer understand.
  • How do Prestissimo metrics map to Prometheus metrics ? Give some examples of metrics that are mapped.
  • How to setup this ?

In addition, I feel that Prometheus StatsReporter should be added based on some config (and not default).

@steveburnett
Copy link
Contributor

@karteekmurthys let me know when you have added the documentation and I'll review it then. If I can help you get the docs drafted into review-ready status, just ask!

@karteekmurthys
Copy link
Contributor Author

@karteekmurthys let me know when you have added the documentation and I'll review it then. If I can help you get the docs drafted into review-ready status, just ask!

@steveburnett thanks! I am currently drafting it. Will reach out to you soon.

cc: @aditi-pandit


private:
/// Mapping of registered stats key to StatType.
mutable std::unordered_map<std::string, facebook::velox::StatType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using https://github.com/jupp0r/prometheus-cpp or some opensource implementation for stats collection? If not any specific reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data model is not too complex to rely on a library at this point. We want to keep external dependencies minimal as well. If we discover that writing serialization ourselves is getting complex we may consider using the library.
The library has additional support to create exporter as well as a push support to a gateway.

void RegisterCollectable(const std::weak_ptr<Collectable>& collectable,
                           const std::string& uri = std::string("/metrics"));

  void RegisterAuth(
      std::function<bool(const std::string&, const std::string&)> authCB,
      const std::string& realm = "Prometheus-cpp Exporter",
      const std::string& uri = std::string("/metrics"));

I don't think we need these capabilities at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it supports all metrics, is production tested and widely used (though like an unofficial library)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a draft PR - #21753, we believe that using the library is the fastest way to production since its production tested. Not sure how to integrate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the library seems better if it is production-ready and widely used.
This can be an optional dependency that we can turn on/off at compile time.
I can help with the dependency management.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys I believe Jay already has a PR in progress here #21753
We should base our work on that.
You can help with the dependency management in that PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaystarshot I took a shot at using prometheus-cpp in this PR itself. I have added a compile time settings to include prometheus-directory. Added some unit tests. I will further modify the circle ci configs to run unit tests for this change. Feel free to use this PR for your testing. I will be opening an RFC soon to get this merged with Presto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried this in local? I don't think this will work since there is no exposer defined. My PR #21753 shows how to do that. Can you share the RFC with me too? I can also input from my end thanks

Copy link
Contributor Author

@karteekmurthys karteekmurthys Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need an Exposer. We already expose a REST API, check PrestoServer.cpp changes in this PR.
You can configure your Prometheus server to scrape the enpoint: /v1/info/health/metrics exposed by the worker.
Or you can configure whatever intermediary process to scrape from worker endpoint.
Missed adding the test results in previous comment, here it is: prometheus-format-metrics.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see thanks! I think having an exposer is better since it separates the http server and makes it configurable based on the number of threads etc. In future we can also move this easily to a sidecar. We can sync offline regarding the RFC too if you have a draft prepared already

@@ -0,0 +1,52 @@
ARG PRESTO_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move the Docker/Jenkins changes to its own PR.

@@ -210,6 +211,10 @@ void PrestoServer::run() {
exit(EXIT_FAILURE);
}

if (systemConfig->enableRuntimeStatsCollection()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Karteek, let me add this config in a separte PR. Inside Meta there is a small wrapper on PrestoServer that does following:

if (gflag enabled):
   initialize stats reporting
initialize presto server

I think we can use the config and remove the internal flag altogether. Let me introduce this flag in a PR and modify internal code. You can rebase on top of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Here is the small PR
#21745
Once we merge it, I will refactor internal logic to use the cofnig property.


private:
/// Mapping of registered stats key to StatType.
mutable std::unordered_map<std::string, facebook::velox::StatType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mutable folly::ConcurrentHashMap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally use folly::Synchronized<> to do this. Lmk revisit this after finishing the RFC. I wanted to unblock any testing you wanted to do, so focussed on getting a working version. I need to clean up the tests as well.

/// A mapping from stats key of type COUNT to value.
mutable std::unordered_map<std::string, size_t> metricsMap_;
// Mutex to control access to registeredStats_ and metricMap_ members.
mutable std::mutex mutex_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not need mutex

proxygen::HTTPMessage* /*message*/,
const std::vector<std::unique_ptr<folly::IOBuf>>& /*body*/,
proxygen::ResponseHandler* downstream) {
server->reportHealthMetrics(downstream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? If we just define an exposer that will create another http server on /metrics or /v1/metrics? We can also define the number of threads to be used by the new http server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is exactly what we are trying to avoid. We didn't find a strong need to start a new server sharing space with our worker containers. Reduces the overhead of maintaining and launching another process. Why do you want multiple threads to be spawned by the exposer? are you expecting high traffic.
CMIW, prometheus server is configured to periodically call the scrape endpoint. So, we can expect 1 HTTP request at X seconds interval. If you are expecting huge traffic, then it is not recommended to share space with worker instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isolation is also one point, we do not want to degrade the presto process as much as we can

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkdutta : What approach did Meta take for this ? Does your metric collection use any endpoint in Presto worker process itself, or did you start another server ?

@karteekmurthys
Copy link
Contributor Author

Moved it here: #22360 (review)

@ethanyzhang ethanyzhang deleted the prestissimo-metrics branch July 8, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants