Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

expose some Prometheus metrics #200

Merged
merged 2 commits into from
Apr 5, 2021
Merged

expose some Prometheus metrics #200

merged 2 commits into from
Apr 5, 2021

Conversation

marten-seemann
Copy link
Collaborator

No description provided.

@marten-seemann marten-seemann requested a review from mxinden April 2, 2021 03:35
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see instrumentation for QUIC to come in the future!

now := time.Now()
c.mutex.Lock()
for _, conn := range c.conns {
if rtt, valid := conn.getSmoothedRTT(); valid {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that the RTT is recorded each time Prometheus scrapes the node? Thus the metric data resolution is dependent on the scrape interval of Prometheus?

I would deem this a bit problematic for the following reasons:

  • Data exposed is dependent on external processes.
  • The way to achieve high availability is to run 2 or 3 instances of Prometheus each scraping the same set of targets (nodes). This would directly influence data resolution.
  • In case Prometheus does not scrape the node at all (e.g. due to downtime) no data is recorded in the histogram.

Would it be feasible to record the RTT on each round trip instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 problems with recording the RTT every roundtrip:

  1. If a connection is not sending any data, it will not generate any RTT samples.
  2. If we have one connection with 50ms RTT and one connection with 500ms RTT (assuming both are sending data and both are using the same congestion window), the shorter-RTT connection will generate 10x as many RTT measurements.

What I'm trying to measure is the RTT distribution of all active connections, so we have to make sure that every connection generates the same number of measurements per scraping interval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a connection is not sending any data, it will not generate any RTT samples.

This applies to both the status quo and my suggestion, correct?

If we have one connection with 50ms RTT and one connection with 500ms RTT (assuming both are sending data and both are using the same congestion window), the shorter-RTT connection will generate 10x as many RTT measurements.

Good point. Thanks for clarifying!

What I'm trying to measure is the RTT distribution of all active connections, so we have to make sure that every connection generates the same number of measurements per scraping interval.

Would a collection across connections every X seconds do the trick? Thus data collection would not depend on external factors (number of Prometheus instances and scrape interval) but at the same time uphold the same amount of measurements per connection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to both the status quo and my suggestion, correct?

We're saving the most recent RTT sample for every connection, so this problem is solved in the PR. We might use an RTT measurements that a few seconds old, but considering that we're sending a keep-alive PING every 15s, we're generating fresh RTT measurements quite frequently.

Would a collection across connections every X seconds do the trick? Thus data collection would not depend on external factors (number of Prometheus instances and scrape interval) but at the same time uphold the same amount of measurements per connection.

What's the advantage of that? Looping over all connections is moderately expensive, so it would be good to not do it more frequently than necessary.

Copy link
Member

@mxinden mxinden Apr 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of that?

Looping over the connections on a fixed interval would make the data returned to Prometheus independent of its scrape interval and the amount of Prometheus instances. E.g. if you have 3 Prometheus instances for high availability, you would be looping over the connections 3 times every scrape interval (e.g. minute). Instead, if you collect on a fixed interval (e.g. 1 minute) you would only loop over all collections once per minute, instead of 3 times per minute.

I think it is fine either way.


var collector *aggregatingCollector

func init() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my Golang skills are not misleading me, all packages importing this package will now expose the metrics below, even if they just make use of some small part of the library, e.g. a helper function. Can the same be achieved with local instead of global variables? E.g. can one pass a metric registry at construction time and keep a reference to the metrics in some local struct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the best practice here? If package wants to expose Prometheus metrics, does the user of that package call a package.InitializeMetrics() function? We could certainly do that, if that's the preferred way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the best practice here?

If I recall correctly the best practice in Golang is to use global metric variables and initialization via init like you did here. I am not in favor of this instrumentation style, due to the various issues that global state brings along. Instead I would pass a registry as an argument at object creation, e.g. NewQuicTransport. The returned QuicTransport object would then own the metrics registered with the provided Registry.

I am happy to expand on my architecture, though in the end I think you should go with whatever works best for you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead I would pass a registry as an argument at object creation, e.g. NewQuicTransport.

What would happen if you create multiple transports then? We can only register each metric once.

If the init way is actually the common way to do this in Go, let's get this PR merged. We can always improve it later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if you create multiple transports then? We can only register each metric once.

What I would do is have a RegisterMetrics function that takes a registry and returns a structure with all the metrics. I would then pass a clone of the latter to each transport.

If the init way is actually the common way to do this in Go, let's get this PR merged. We can always improve it later on.

That sounds good to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would do is have a RegisterMetrics function that takes a registry and returns a structure with all the metrics. I would then pass a clone of the latter to each transport.

That sounds like a much cleaner setup. I'm open to making this change in the future.

tracer_metrics.go Outdated Show resolved Hide resolved
tracer_metrics.go Outdated Show resolved Hide resolved
@marten-seemann marten-seemann requested a review from mxinden April 2, 2021 12:50
@marten-seemann marten-seemann force-pushed the prometheus branch 3 times, most recently from ea3afa1 to fb1b1d8 Compare April 4, 2021 05:54
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me from the Prometheus side. Not sure my Golang skills are good enough for a general approval. Your call.

tracer_metrics.go Outdated Show resolved Hide resolved
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants