Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Telemetry Metrics for Metadata Server #2616

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

muhamadazmy
Copy link
Contributor

@muhamadazmy muhamadazmy commented Feb 3, 2025

Implement Telemetry Metrics for Metadata Server

Summary:
This pull request introduces telemetry metrics to the metadata server, enhancing monitoring and performance analysis capabilities.

Fixes #2605

@muhamadazmy muhamadazmy changed the title metadata server metrics Implement Telemetry Metrics for Metadata Server Feb 3, 2025
@muhamadazmy
Copy link
Contributor Author

@tillrohrmann feel free to suggest more metrics to add that you think can be useful.

@muhamadazmy muhamadazmy marked this pull request as ready for review February 3, 2025 16:22
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding metrics to the RaftMetadataStore @muhamadazmy. The changes look good to me. What we might also add as metrics are the applied index, committed index, snapshot size, trim index, leader, term, etc. (basically what we report in MetadataStoreSummary). This could be done as a follow-up if you want to merge this PR.

@tillrohrmann
Copy link
Contributor

It would be great to investigate why the e2e test is failing. I don't expect that your changes are causing it but we need to harden these test failures and make sure that it is not pointing towards a real problem.

@muhamadazmy
Copy link
Contributor Author

@tillrohrmann Great! thank you for your review. I wanted to add some of those metrics but wanted to make sure they would be useful since they already reported in the summary.

I will investigate why the tests are failing.

@muhamadazmy muhamadazmy force-pushed the pr2616 branch 2 times, most recently from 86d4c7d to bf1ecd6 Compare February 4, 2025 13:24
@muhamadazmy
Copy link
Contributor Author

Thank you @tillrohrmann for your review. I added more metrics as you suggested. I also added few simple metrics to the metadata store client as well as Ahmed suggested.

use restate_core::metadata_store::{serialize_value, Precondition};
use restate_types::config::Configuration;
use restate_types::metadata_store::keys::NODES_CONFIG_KEY;
use restate_types::nodes_config::NodesConfiguration;
use restate_types::storage::StorageCodec;
use std::ops::Deref;
use std::time::Instant;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use tokio's Instant, thanks :)

@@ -299,28 +306,43 @@ impl MetadataStoreClient {
&self,
key: ByteString,
) -> Result<Option<T>, ReadError> {
let value = self.inner.get(key).await?;
let start_time = Instant::now();
counter!(METADATA_CLIENT_GET_TOTAL, "key" => key.to_string()).increment(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about the cardinality of this counter. We have a key per partition for epoch metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i wasn't sure about this tbh. On other hand it also can help debugging irregular access pattern to a certain key.

What about if the key is prefixed with pp_epoch we only use that prefix for the metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's okay, let's leave the key for counters, but histogram per key is not great.

@@ -26,6 +32,7 @@ use restate_types::storage::{StorageCodec, StorageDecode, StorageEncode, Storage
use restate_types::{flexbuffers_storage_encode_decode, Version, Versioned};
use std::future::Future;
use std::sync::Arc;
use std::time::Instant;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's better to use tokio's Instant, but that's not an issue in this case. Just a reminder to form the habit.

Comment on lines 339 to 343
counter!(METADATA_CLIENT_GET_VERSION_TOTAL, "key" => key.to_string()).increment(1);
let result = self.inner.get_version(key.clone()).await;

histogram!(METADATA_CLIENT_GET_VERSION_DURATION, "key" => key.to_string())
.record(start_time.elapsed());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of histograms (cardinality is a problem here as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Indeed. How do you feel about the suggested solution above. Another solution of course it to drop the key label completely

@@ -53,6 +60,9 @@ impl MetadataStoreHandler {
#[async_trait]
impl MetadataServerSvc for MetadataStoreHandler {
async fn get(&self, request: Request<GetRequest>) -> Result<Response<GetResponse>, Status> {
counter!(METADATA_SERVER_GET_TOTAL).increment(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would have been nice if the counter has a dimension for status (success vs. failure). Imagine wanting to monitor success rate of a give metadata server, how would we do that?

This comment applies to all fallible operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's definitely a good idea. I will update it


// Raft specific metrics
pub(crate) const METADATA_SERVER_RAFT_SENT_MESSAGE_TOTAL: &str =
"restate.metadata_server.raft.sent_messages.total";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the raft part in this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's raft specific metric. But maybe should renamed to embedded or replicated instead since raft is just an implementation detail

@muhamadazmy
Copy link
Contributor Author

Thanks @AhmedSoliman for your review. I have processed all your comments. I also dropped the key label from the client metrics, and recorded separate counter for both success and error for both client and server.

@AhmedSoliman
Copy link
Contributor

Do we have per-key histograms on the server metrics?

Comment on lines 329 to 330
@success=METADATA_CLIENT_GET_TOTAL_SUCCESS,
@error=METADATA_CLIENT_GET_TOTAL_ERROR,
@duration=METADATA_CLIENT_GET_DURATION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating successes and errors into different metric names makes hard at query time to aggregate. Maybe take a look at an example in task_center.rs that emits the status as dimension instead.

counter!(TC_FINISHED, "kind" => kind_str, "status" => TC_STATUS_FAILED)
    .increment(1);

or from ingress:

counter!(
    INGRESS_REQUESTS,
    "status" => REQUEST_COMPLETED,
    "rpc.service" => service_name,
    "rpc.method" => handler_name,
)
.increment(1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I should have thought of that :D

Comment on lines 277 to 291
macro_rules! metrics_helper {
(@success=$ok:ident, @error=$err:ident, @duration=$dur:ident, $block:block) => {{
let start_time = tokio::time::Instant::now();
let result = $block;
metrics::histogram!($dur).record(start_time.elapsed());
if result.is_ok() {
metrics::counter!($ok).increment(1);
} else {
metrics::counter!($err).increment(1);
}

result
}};
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth a macro if this is used only a couple of times to be honest. I'm not sure if it needs a macro if it's just the dimension name that will report the status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only to not repeat myself. It's easier to make changes for example.

@muhamadazmy muhamadazmy force-pushed the pr2616 branch 4 times, most recently from 3f1d9b7 to 3a770d5 Compare February 5, 2025 12:57
Comment on lines 141 to 164
"Raft Metadata raft term number"
);

describe_gauge!(
METADATA_SERVER_EMBEDDED_APPLIED_LSN,
Unit::Count,
"Raft Metadata raft applied lsn"
);

describe_gauge!(
METADATA_SERVER_EMBEDDED_COMMITTED_LSN,
Unit::Count,
"Raft Metadata raft committed lsn"
);

describe_gauge!(
METADATA_SERVER_EMBEDDED_FIRST_INDEX,
Unit::Count,
"Raft Metadata raft first index"
);
describe_gauge!(
METADATA_SERVER_EMBEDDED_LAST_INDEX,
Unit::Count,
"Raft Metadata raft last index"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedded vs. raft vs. replicated? cc. @tillrohrmann

Copy link
Contributor

@tillrohrmann tillrohrmann Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the Raft metadata server kind to "replicated" in this PR: #2632.

Comment on lines 130 to 133
counter!(METADATA_SERVER_EMBEDDED_RECV_MESSAGE_TOTAL, "peer" => self.remote_peer.to_string())
.increment(1);
counter!(METADATA_SERVER_EMBEDDED_RECV_MESSAGE_BYTES, "peer" => self.remote_peer.to_string())
.increment(message.payload.len() as u64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the peer address, this is literally infinite cardinality as those sockets get random ports

Copy link
Contributor Author

@muhamadazmy muhamadazmy Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the peer address, this only holds the plain node id. But if it doesn't add any value i can remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not the address, then it's okay.

Comment on lines 215 to 218
counter!(METADATA_SERVER_EMBEDDED_SENT_MESSAGE_TOTAL, "peer" => self.remote_peer.to_string())
.increment(1);
counter!(METADATA_SERVER_EMBEDDED_SENT_MESSAGE_BYTES, "peer" => self.remote_peer.to_string())
.increment(size as u64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Summary:
This pull request introduces telemetry metrics to the metadata server, enhancing monitoring and performance analysis capabilities.

Fixes restatedev#2605
@muhamadazmy
Copy link
Contributor Author

@AhmedSoliman thank you for your helpful input. I processed all your comments. Feel free to take another look, leave comments or let me know if it's Approved :)

Copy link
Contributor

@AhmedSoliman AhmedSoliman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ready to go

@muhamadazmy muhamadazmy merged commit 2760587 into restatedev:main Feb 5, 2025
54 checks passed
@muhamadazmy muhamadazmy deleted the pr2616 branch February 5, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add telemetry metrics to metadata-store
3 participants