Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more metrics #2310

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open

Add more metrics #2310

wants to merge 45 commits into from

Conversation

rafal-ch
Copy link
Contributor

@rafal-ch rafal-ch commented Oct 7, 2024

Closes #807

Description

This PR adds a couple of additional metrics (block importer and p2p) and also contains slight refactor of how we initialize bucket sized for histogram based metrics.

The metrics command-line parameter has been replaced with disable-metrics. Metrics are now enabled by default, with the option to disable them entirely or on a per-module basis. This change is breaking for all dependencies that use CLI to setup the fuel-core-client

Checklist

  • Breaking changes are clearly marked as such in the PR description and changelog
  • New behavior is reflected in tests

Before requesting review

  • I have reviewed the code myself

@rafal-ch rafal-ch changed the title Add gas_per_block metric Add more metrics Oct 7, 2024
Comment on lines 234 to 261
let mut swarm = if metrics {
// we use the global registry to store the metrics without needing to create a new one
// since libp2p already creates sub-registries
let mut registry = global_registry().registry.lock();

swarm_builder
.with_bandwidth_metrics(&mut registry)
.with_behaviour(|_| behaviour)?
.with_swarm_config(|cfg| {
if let Some(timeout) = config.connection_idle_timeout {
cfg.with_idle_connection_timeout(timeout)
} else {
cfg
}
})
.build()
} else {
swarm_builder
.with_behaviour(|_| behaviour)?
.with_swarm_config(|cfg| {
if let Some(timeout) = config.connection_idle_timeout {
cfg.with_idle_connection_timeout(timeout)
} else {
cfg
}
})
.build()
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libp2p generics prevent me from doing something like this -

let swarm_builder = initial_stuff;
let swarm builder = if metrics {
  let mut registry = ..;
  swarm_builder.with_bandwidth_metrics(&mut registry)
} else {
  swarm_builder
}
.. rest of the stuff with_behavior etc

@rafal-ch rafal-ch closed this Oct 8, 2024
@rafal-ch rafal-ch reopened this Oct 8, 2024
@rafal-ch rafal-ch marked this pull request as ready for review October 8, 2024 15:41
],
),
(
// 50 ETH is expected to be the realistic maximum fee for the time being.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should still specify that this is in Gwei.

Also, I don't know if we should bother having so much granularity up high. When I said 50 ETH in our convo, I was thinking in terms of the Maximum conceiveable fee, well beyond what will happen any time soon.

We should have many, much lower.

For example a single tx from testnet here:

gasCosts:{
    fee:"10",
    gasUsed:"87570"
},

This is during low volume, etc, but it's an idea of what we could be working with, much more likely than even 0.05 ETH.

So perhaps we break it down like: 0, 100, 200, 400, 800, 1_600, etc... It's less aggressive than log_10, and gives us helpful granularity.

The highest for now can just be 52_428_800.0? Are we okay with 20 buckets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should still specify that this is in Gwei.

The metric itself is explicit about the unit:

        registry.register(
            "executor_fee_per_block_gwei",
            "The total fee (gwei) paid by transactions in a block",
            fee_per_block.clone(),
        );

I see that we don't use gwei across the fuel-core codebase (except for the fuel-gas-price-algorithm crate), which I think makes "gwei" unit a default for developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps we break it down like: 0, 100, 200, 400, 800, 1_600, etc... It's less aggressive than log_10, and gives us helpful granularity.

The highest for now can just be 52_428_800.0? Are we okay with 20 buckets?

Yes, that's a good observation, thanks. I think we're good with ~20 buckets, we can still re-adjust them once we start observing real-world data.

Updated in b41331b

Copy link
Member

@MitchTurner MitchTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In gereral, we should make sure we have the "empty" case distinguished from the "just one" case, imo. because in the testnet we have a lot of empty blocks.

Optimistically, that will be a very empty bucket :)

),
(
// Transaction count is fixed at 65_535.
Buckets::TransactionsCount,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The low end could also be informative in metrics. 1 is good to know and it's a very state of the chain between 1 and even 2 or 3, so I'd prefer to have 1-5 all there I think. And then probably 200, 400, 800, 1600, 3200 on the high end for now. We can always add more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these buckets so that we are very detailed at low, then we reach your proposed high end around 16th bucket, but then I left a few at the very end for the high values. Commit is here: ef13002

Distribution visualisation:
image

),
(
// Consider blocks up to 256kb in size, one bucket per 16kb.
Buckets::SizeUsed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some smaller values are useful here too. Down to 0... or whatever header + mint comes to :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 5387bc9

@@ -441,6 +442,19 @@ where
)?;
debug_assert!(data.found_mint, "Mint transaction is not found");

executor_metrics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log these metrics only if config.metrics is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put up a draft PR, because I'm not sure about how to approach this correctly. Please see the description of this PR: #2323

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the convo, metrics have been moved to the importer and this commit makes the importer respect the --metrics parameter.

@rafal-ch rafal-ch marked this pull request as draft October 9, 2024 11:33
@rafal-ch
Copy link
Contributor Author

rafal-ch commented Oct 9, 2024

Back to draft. As per the call - we'll be moving metrics back to "importer".

@rafal-ch rafal-ch marked this pull request as ready for review October 9, 2024 17:56
@@ -492,7 +534,10 @@ impl FuelP2PService {
);
None
}
_ => None,
_ => {
self.update_libp2p_metrics(&event);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to record non desired events? What about fuel_behaviour event?

Comment on lines +348 to +355
pub fn update_libp2p_metrics<E>(&self, event: &E)
where
Metrics: Recorder<E>,
{
if let Some(registry) = self.libp2p_metrics_registry.as_ref() {
self.update_metrics(|| registry.record(event));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is it to call for each event? And what kind of information it will record?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't have a screenshot on me right now, but I can share in a while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# HELP libp2p_gossipsub_messages Number of messages received.
# TYPE libp2p_gossipsub_messages counter
libp2p_gossipsub_messages_total 0
# HELP libp2p_identify_errors Number of errors while attempting to identify the remote.
# TYPE libp2p_identify_errors counter
libp2p_identify_errors_total 0
# HELP libp2p_identify_pushed Number of times identification information of the local node has been actively pushed to a peer..
# TYPE libp2p_identify_pushed counter
libp2p_identify_pushed_total 0
# HELP libp2p_identify_received Number of times identification information has been received from a peer.
# TYPE libp2p_identify_received counter
libp2p_identify_received_total 6
# HELP libp2p_identify_sent Number of times identification information of the local node has been sent to a peer in response to an identification request.
# TYPE libp2p_identify_sent counter
libp2p_identify_sent_total 6
# HELP libp2p_identify_remote_protocols Number of connected nodes supporting a specific protocol, with "unrecognized" for each peer supporting one or more unrecognized protocols
# TYPE libp2p_identify_remote_protocols gauge
libp2p_identify_remote_protocols{protocol="/ipfs/id/1.0.0"} 3
libp2p_identify_remote_protocols{protocol="/ipfs/id/push/1.0.0"} 3
libp2p_identify_remote_protocols{protocol="unrecognized"} 3
# HELP libp2p_identify_remote_listen_addresses Number of connected nodes advertising a specific listen address
# TYPE libp2p_identify_remote_listen_addresses gauge
libp2p_identify_remote_listen_addresses{listen_address="/dns/tcp/p2p"} 3
libp2p_identify_remote_listen_addresses{listen_address="/ip4/tcp"} 3
# HELP libp2p_identify_local_observed_addresses Number of connected nodes observing the local node at a specific address
# TYPE libp2p_identify_local_observed_addresses gauge
libp2p_identify_local_observed_addresses{observed_address="/ip4/tcp"} 3
# HELP libp2p_kad_query_result_get_record_ok Number of records returned by a successful Kademlia get record query.
# TYPE libp2p_kad_query_result_get_record_ok counter
libp2p_kad_query_result_get_record_ok_total 0
# HELP libp2p_kad_query_result_get_record_error Number of failed Kademlia get record queries.
# TYPE libp2p_kad_query_result_get_record_error counter
# HELP libp2p_kad_query_result_get_closest_peers_ok Number of closest peers returned by a successful Kademlia get closest peers query.
# TYPE libp2p_kad_query_result_get_closest_peers_ok histogram
libp2p_kad_query_result_get_closest_peers_ok_sum 0.0
libp2p_kad_query_result_get_closest_peers_ok_count 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="1.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="2.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="4.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="8.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="16.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="32.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="64.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="128.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="256.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="512.0"} 0
libp2p_kad_query_result_get_closest_peers_ok_bucket{le="+Inf"} 0
# HELP libp2p_kad_query_result_get_closest_peers_error Number of failed Kademlia get closest peers queries.
# TYPE libp2p_kad_query_result_get_closest_peers_error counter
# HELP libp2p_kad_query_result_get_providers_ok Number of providers returned by a successful Kademlia get providers query.
# TYPE libp2p_kad_query_result_get_providers_ok histogram
libp2p_kad_query_result_get_providers_ok_sum 0.0
libp2p_kad_query_result_get_providers_ok_count 0
libp2p_kad_query_result_get_providers_ok_bucket{le="1.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="2.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="4.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="8.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="16.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="32.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="64.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="128.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="256.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="512.0"} 0
libp2p_kad_query_result_get_providers_ok_bucket{le="+Inf"} 0
# HELP libp2p_kad_query_result_get_providers_error Number of failed Kademlia get providers queries.
# TYPE libp2p_kad_query_result_get_providers_error counter
# HELP libp2p_kad_query_result_num_requests Number of requests started for a Kademlia query.
# TYPE libp2p_kad_query_result_num_requests histogram
# HELP libp2p_kad_query_result_num_success Number of successful requests of a Kademlia query.
# TYPE libp2p_kad_query_result_num_success histogram
# HELP libp2p_kad_query_result_num_failure Number of failed requests of a Kademlia query.
# TYPE libp2p_kad_query_result_num_failure histogram
# HELP libp2p_kad_query_result_duration_seconds Duration of a Kademlia query.
# TYPE libp2p_kad_query_result_duration_seconds histogram
# UNIT libp2p_kad_query_result_duration_seconds seconds
# HELP libp2p_kad_routing_updated Number of peers added, updated or evicted to, in or from a specific kbucket in the routing table.
# TYPE libp2p_kad_routing_updated counter
libp2p_kad_routing_updated_total{action="Updated",bucket="254"} 2
libp2p_kad_routing_updated_total{action="Updated",bucket="255"} 4
# HELP libp2p_kad_inbound_requests Number of inbound requests.
# TYPE libp2p_kad_inbound_requests counter
# HELP libp2p_swarm_connections_incoming Number of incoming connections per address stack.
# TYPE libp2p_swarm_connections_incoming counter
# HELP libp2p_swarm_connections_incoming_error Number of incoming connection errors.
# TYPE libp2p_swarm_connections_incoming_error counter
# HELP libp2p_swarm_new_listen_addr Number of new listen addresses.
# TYPE libp2p_swarm_new_listen_addr counter
# HELP libp2p_swarm_expired_listen_addr Number of expired listen addresses.
# TYPE libp2p_swarm_expired_listen_addr counter
# HELP libp2p_swarm_external_addr_candidates Number of new external address candidates.
# TYPE libp2p_swarm_external_addr_candidates counter
libp2p_swarm_external_addr_candidates_total{protocols="/ip4/tcp"} 3
# HELP libp2p_swarm_external_addr_confirmed Number of confirmed external addresses.
# TYPE libp2p_swarm_external_addr_confirmed counter
# HELP libp2p_swarm_external_addr_expired Number of expired external addresses.
# TYPE libp2p_swarm_external_addr_expired counter
# HELP libp2p_swarm_listener_closed Number of listeners closed.
# TYPE libp2p_swarm_listener_closed counter
# HELP libp2p_swarm_listener_error Number of listener errors.
# TYPE libp2p_swarm_listener_error counter
libp2p_swarm_listener_error_total 0
# HELP libp2p_swarm_dial_attempt Number of dial attempts.
# TYPE libp2p_swarm_dial_attempt counter
libp2p_swarm_dial_attempt_total 3
# HELP libp2p_swarm_outgoing_connection_error Number outgoing connection errors.
# TYPE libp2p_swarm_outgoing_connection_error counter
libp2p_swarm_outgoing_connection_error_total{peer="Known",error="TransportOther"} 2
# HELP libp2p_swarm_connections_established Number of connections established.
# TYPE libp2p_swarm_connections_established counter
libp2p_swarm_connections_established_total{role="Dialer",protocols="/dns/tcp/p2p"} 3
# HELP libp2p_swarm_connections_establishment_duration Time it took (locally) to establish connections.
# TYPE libp2p_swarm_connections_establishment_duration histogram
libp2p_swarm_connections_establishment_duration_sum{role="Dialer",protocols="/dns/tcp/p2p"} 2.410768541
libp2p_swarm_connections_establishment_duration_count{role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="0.01",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.015",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.0225",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.03375",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.050625",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.0759375",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.11390625",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.170859375",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.2562890625",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.38443359375",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.576650390625",role="Dialer",protocols="/dns/tcp/p2p"} 0
libp2p_swarm_connections_establishment_duration_bucket{le="0.8649755859375",role="Dialer",protocols="/dns/tcp/p2p"} 2
libp2p_swarm_connections_establishment_duration_bucket{le="1.29746337890625",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="1.946195068359375",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="2.9192926025390628",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="4.378938903808594",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="6.568408355712891",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="9.852612533569337",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="14.778918800354005",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="22.168378200531007",role="Dialer",protocols="/dns/tcp/p2p"} 3
libp2p_swarm_connections_establishment_duration_bucket{le="+Inf",role="Dialer",protocols="/dns/tcp/p2p"} 3
# HELP libp2p_swarm_connections_duration_seconds Time a connection was alive.
# TYPE libp2p_swarm_connections_duration_seconds histogram
# UNIT libp2p_swarm_connections_duration_seconds seconds
# HELP libp2p_bandwidth_bytes Bandwidth usage by direction and transport protocols.
# TYPE libp2p_bandwidth_bytes counter
# UNIT libp2p_bandwidth_bytes bytes
libp2p_bandwidth_bytes_total{protocols="/dns/tcp/p2p",direction="Outbound"} 6041
libp2p_bandwidth_bytes_total{protocols="/ip4/tcp/p2p",direction="Outbound"} 0
libp2p_bandwidth_bytes_total{protocols="/dns/tcp/p2p",direction="Inbound"} 1243523
libp2p_bandwidth_bytes_total{protocols="/ip4/tcp/p2p",direction="Inbound"} 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for the overhead, we are collecting a lot of data that will be useful. I suggest we turn off metrics collection for libp2p metrics except for 1 sentry at the least. when there are issues with that one sentry we can enable for the others and compare values

@@ -517,13 +562,23 @@ impl FuelP2PService {
event: FuelBehaviourEvent,
) -> Option<FuelP2PEvent> {
match event {
FuelBehaviourEvent::Gossipsub(event) => self.handle_gossipsub_event(event),
FuelBehaviourEvent::Gossipsub(event) => {
self.update_libp2p_metrics(&event);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @xgreenx this is where FuelBehaviour metrics are logged

@rafal-ch rafal-ch added the breaking A breaking api change label Oct 14, 2024
Module,
};

#[test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test for invalid values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking A breaking api change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants