Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging (2/2): Replace DelayDetector #9543

Merged
merged 16 commits into from
Sep 25, 2023
Merged

Logging (2/2): Replace DelayDetector #9543

merged 16 commits into from
Sep 25, 2023

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Sep 20, 2023

Continues #9540

Deletes the DelayDetector class and replaces the uses of it with spans.

@nikurt nikurt changed the title Logging: Replace DelayDetector Logging (2/2): Replace DelayDetector Sep 20, 2023
nikurt and others added 10 commits September 22, 2023 12:03
The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.
…ration (#9539)

* remove force_* features and move runtime selection to runtime configuration

There are a couple of reasons, but the most major one is that these
`force_*` fetures make it impossible to run tools with `--all-features`
enabled on macOS (where use of wasmer0 would be forced somewhat
arbitrarily.)

That said, these options don't make sense, in the world of limited
replayability, and with runtime configuration there is also now a fairly
straightforward approach to forcing the use of a different runtime than
the "default" (just chnge some yamls.)

As an upside this also forced my hand to go and fix the Config::test
nonsense, a little bit, hence

Fixes #8202

and tests will now actually exercise the system in a configuration that
is going to run in production most of the time! Yay!

* Replace !0 with PROTOCOL_VERSION

* Remove the wasmtime feature from runtime-params-estimator

* Fix test expectations
* debug: Add #[perf] to more actors

This allows to observe more details about actor utilization when
compiling with `--features=performance_stats`.

Running some loadtest with this reveals that `ClientActor`
and `ShardsManagerActor` are the busiest in some cases, which
would not be possible to observe without the instrumentation added
in this commit.

* cargo fmt
Chunks can be MBs in size. We should not force a read from DB when it's
not cached already.

Also alters the `chunk_exists` method to check the cache first.
An `exists` call on the DB is still more expensive than looking up an
`Arc` in the in-memory cache.
* Logging: Deprecate --verbose

--verbose is not a bool, but a String argument, which is confusing.

* Logging: Deprecate --verbose

--verbose is not a bool, but a String argument, which is confusing.

* Logging: Deprecate --verbose

--verbose is not a bool, but a String argument, which is confusing.
* Logging: New delay detector

* Logging: New delay detector

* Logging: New delay detector

* Logging: New delay detector

* Rename to SpanDurationLogger

* Use Durations instead of nanoseconds

* refactor: Tracing (#9544)

The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.

* Logging: New delay detector

* Rename to SpanDurationLogger

* Use Durations instead of nanoseconds

* Use Durations instead of nanoseconds

* refactor: Tracing (#9544)

The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.

* Logging: New delay detector

* refactor: Tracing (#9544)

The `lib.rs` file accumulated structs and functionality from different areas.
Split the file into several smaller files.

* Use Durations instead of nanoseconds
@nikurt nikurt marked this pull request as ready for review September 22, 2023 10:04
@nikurt nikurt requested a review from a team as a code owner September 22, 2023 10:04
@nikurt nikurt requested a review from jakmeier September 22, 2023 10:04
Copy link
Contributor

@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍 Nothing better than deleting old code while also improving what it was doing :D

I left one comment regarding the ClientActor::wrap function. Feel free to ignore or address in a follow up in the undefined future.

@@ -263,8 +263,7 @@ impl ClientActor {
) -> Res {
let (_span, msg) = handler_debug_span!(target: "client", msg, msg_type);
self.check_triggers(ctx);
let _d =
delay_detector::DelayDetector::new(|| format!("NetworkClientMessage {:?}", msg).into());
let _span_inner = tracing::debug_span!(target: "client", "NetworkClientMessage").entered();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, without the actual message printed, it becomes rather hard to figure out what is slow if this triggers.
Probably not a big problem. But if you have an idea how we can improve it, ideally without eagerly debug printing every time, it would be nice.
(If #[track_caller] worked with tracing spans and report the callers location instead of this one inside the wrap, I think that would be perfect. I don't know for sure but I think it doesn't work this way, however.)

Also, I think you can remove the std::fmt::Debug requirement on the function if you don't use debug formatting anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is slow if this triggers

That's when Grafana Tempo should be useful. It (should and will eventually) ingest the tracing information, and will let us search for long running "NetworkClientMessage" spans.

@nikurt nikurt added this pull request to the merge queue Sep 25, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 25, 2023
@nikurt nikurt added this pull request to the merge queue Sep 25, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 25, 2023
@nikurt nikurt added this pull request to the merge queue Sep 25, 2023
Merged via the queue into master with commit e9751e9 Sep 25, 2023
@nikurt nikurt deleted the nikurt-dd branch September 25, 2023 11:46
nikurt added a commit that referenced this pull request Sep 26, 2023
Continues #9540

Deletes the `DelayDetector` class and replaces the uses of it with
spans.

---------

Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
nikurt added a commit that referenced this pull request Sep 26, 2023
Continues #9540

Deletes the `DelayDetector` class and replaces the uses of it with
spans.

---------

Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
nikurt added a commit that referenced this pull request Oct 11, 2023
Continues #9540

Deletes the `DelayDetector` class and replaces the uses of it with
spans.

---------

Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Simonas Kazlauskas <git@kazlauskas.me>
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants