Speed up logging once again #9981

koute · 2021-10-08T12:43:02Z

This PR tries to speed up our logging machinery by introducing a buffered semi-asynchronous stderr writer.

(In our profiling on one of the validators we've seen ~5% of total time being spent writing to stderr.)

For more details please read the doc comment included in the stderr_writer.rs.

As far as benchmarks go, this is pretty hard to benchmark; depending on how the benchmark's set up it can either show zero improvement, or a huge improvement. Nevertheless in a microbenchmark which tries to replicate similar conditions to what we're running on our validator nodes (which is: executable is launched as a systemd service, and then just keeps on puking logs into the systemd journal) this PR cuts down the total execution time of the microbenchmark from ~8s to ~2s.

(While I'm at it I've also enabled the parking_lot feature for the tracing-subscriber crate which measurably lowers the overhead in case where we're not printing anything; there's still more work that can be done for that case, but that will come in another PR.)

bkchr · 2021-10-09T19:45:35Z

Why not use this: https://docs.rs/tracing-appender/0.1.2/tracing_appender/non_blocking/index.html ?

koute · 2021-10-11T05:24:08Z

Why not use this: https://docs.rs/tracing-appender/0.1.2/tracing_appender/non_blocking/index.html ?

That's a great question!

It's slower. On the same benchmark where my logger takes ~2s the one from tracing-appender takes ~5s. (Assuming I turn on non-lossy mode so that it doesn't drop any logs; if I let it run in lossy mode so that it can drop logs it takes ~2.6s, so still slightly slower than my logger which doesn't drop anything.)
It applies backpressure only based on the number of lines printed, and not due to their length. And we do have places in our code where we essentially log untrusted user input verbatim, so while unlikely in theory it could maybe be a potential vector for a DoS attack.
I'm not a total fan of the shutdown semantics; when that logger is shutdown (that is: the guard is dropped) it stops working instead of just printing everything synchronously like mine does. Probably doesn't matter in practice, but still.

koute · 2021-10-11T05:35:23Z

(Force-pushed to resolve conflicts.)

gilescope · 2021-10-11T10:07:42Z

A test would be nice to prove that while spinning up n threads all logging that all generated log messages did get logged and none fell through any cracks.

client/tracing/src/logging/stderr_writer.rs

koute · 2021-10-11T12:42:00Z

A test would be nice to prove that while spinning up n threads all logging that all generated log messages did get logged and none fell through any cracks.

I've added an integration test which does just that.

gilescope · 2021-10-12T12:08:54Z

Nice test. Good that nothing is going astray.

client/tracing/tests/logger.rs

bkchr · 2021-10-18T07:55:23Z

client/tracing/src/logging/stderr_writer.rs

+		if !ENABLE_ASYNC_LOGGING.load(Ordering::Relaxed) {
+			sync_flush_on_drop = true;
+		}


If the logger was already deinitialized but something still wants to log something for some reason then instead of appending that to a buffer which will never be flushed (because the background thread which does the flushing is most likely already dead at this point) it will instead just log it synchronously.

Basically, a just-in-case handling of a potential corner case.

bkchr · 2021-10-18T07:58:07Z

client/tracing/src/logging/stderr_writer.rs

+pub struct StderrWriter {
+	buffer: Option<parking_lot::lock_api::MutexGuard<'static, parking_lot::RawMutex, Vec<u8>>>,
+	sync_flush_on_drop: bool,
+	original_len: usize,


When is that not 0?

The buffer is not always immediately flushed; as long as the amount of data in the buffer is less than ASYNC_FLUSH_THRESHOLD then it will only be flushed every AUTOFLUSH_EVERY seconds (50ms in this case), e.g. if you call log!("Hello World!") two times in rapid succession then on the second call the original_len will be non-zero, since the buffer will still contain the data from the first call which wasn't yet flushed (because neither the amount of data exceeded the ASYNC_FLUSH_THRESHOLD nor the amount of time which has passed since the last flush exceeded AUTOFLUSH_EVERY, so the thread which does the flushing is still sleeping).

client/tracing/src/logging/stderr_writer.rs

bkchr · 2021-10-18T07:59:19Z

client/tracing/src/logging/stderr_writer.rs

+extern "C" fn on_exit() {
+	ENABLE_ASYNC_LOGGING.store(false, Ordering::SeqCst);
+
+	if let Some(buffer) = BUFFER.try_lock_for(ON_EXIT_FLUSH_TIMEOUT) {
+		flush_logs(buffer);
+	}
+}


Why do we need this? Couldn't the Drop implementation just set ENABLE_ASYNC_LOGGING=false and the thread finishes when it sees this?

We need some way to detect that we're exiting, and yes, we could technically do that through Drop, but that would require us to actually make sure it gets dropped. We can't make that a static as Rust doesn't actually call drop on anything static. We could do it through the .fini or .dtors linker sections, but that's unportable (and possibly nightly-only; haven't used those in a long time so I don't remember if it's possible to use them on stable Rust already). We could use a thread local and check whenever we're on the thread which initialized the logger and then run drop (since TLS destructors do get run), but that's kind of a hack. We could have the logger initializer return a handle which would control the lifetime, but that's inconvenient and potentially error prone in this case.

And depending on which drop variant we'd go for the logger might not be flushed if libc::exit is called, which, yes, probably doesn't normally happen, but I think it's a good idea to cover all of your bases.

So unless I'm missing something what's left is registering a handler through the standard atexit.

(Of course this still doesn't cover a few cases, e.g. SIGSEGV from a segmentation fault and SIGABRT from abort being called, but I'm not sure if we want to go that far as that's a little more tricky since it involves signal handling.)

As far as why we're doing the try_lock_for here - on exit the threads are just going to be forcibly killed, so we want to at least give the thread here some time to flush itself, but to not indefinitely hang if for some reason it can't do it (because e.g. the other side of stderr stopped reading).

But StderrWriter is dropped at some point that could trigger the logic I described above?

Sorry, I mean the MakeStderrWriter

If drop would actually be called for MakeStderrWriter then yeah, but it isn't from what I can see. (Under the hood the whole structure which owns the MakeStderrWriter is assigned to a static mut and as any other static in Rust it's never dropped.)

Yeah. Makes sense.

client/tracing/src/logging/stderr_writer.rs

bkchr

Looks really good. Good work 👍

The only problem I see is around aborts, but they are very rarely anyway.

One last nitpick and then it is good to be merged.

bkchr · 2021-10-21T07:24:43Z

client/tracing/src/logging/stderr_writer.rs

+extern "C" fn on_exit() {
+	ENABLE_ASYNC_LOGGING.store(false, Ordering::SeqCst);
+
+	if let Some(buffer) = BUFFER.try_lock_for(ON_EXIT_FLUSH_TIMEOUT) {
+		flush_logs(buffer);
+	}
+}


Yeah. Makes sense.

client/tracing/src/logging/stderr_writer.rs

koute · 2021-10-21T08:51:49Z

Hmmm.... it looks like the continuous-integration/gitlab-node-bench-regression-guard benchmark-test for importing blocks is failing, which is... interesting, as this definitely shouldn't affect it. Apparently it's almost three times slower now, which makes no sense at all. (For one, the benchmark shouldn't be logging that much to begin with, and if it does then it'd be more of a logger + environment benchmark than a block import one; and also even if it does log a lot then even in the worst case scenario this should not be slower than what we had before.)

The only reason I can think of as to why it could have regressed is that I've also bumped tracing-subscriber to the newest version, and maybe that has some extra overhead when we're not logging?

I'll try to revert the tracing-subscriber bump (since we're going to bump it during #10072 anyway) and/or run the benches locally to see what's up.

bkchr · 2021-10-21T09:31:28Z

@koute no don't do this. This test is failing everywhere, it depends on the machine it is executed on...

koute · 2021-10-21T09:36:08Z

@koute no don't do this. This test is failing everywhere, it depends on the machine it is executed on...

Ah, okay, that's good to know! I was just about to run it locally. (:

bkchr · 2021-10-21T13:35:36Z

@koute merging master now should fix the benchmark test.

… function

koute · 2021-10-21T13:42:27Z

Okay, rebased on the newest master with no extra changes.

bkchr · 2021-10-21T14:04:24Z

@koute in the future please you merge commits. We squash merge to master anyway and that makes it easier for reviewers.

koute · 2021-10-21T14:10:57Z

@koute in the future please you merge commits. We squash merge to master anyway and that makes it easier for reviewers.

My apologies; force of habit. I'll be sure to just use merge commits when possible in the future.

koute · 2021-10-21T16:54:06Z

bot merge

* Update `tracing`-related dependencies * Enable `parking_lot` feature in `tracing-subscriber` * Add an asynchronous stderr logger * Make clippy happy * Add an integration test for the logger * Refactor `test_logger_filters`'s subprocess machinery into a separate function * Use a child process instead of hooking into stderr for the test * Add a doc comment for `MakeStderrWriter` * Move the initialization into the `MakeStderrWriter`'s constructor * Add an extra test case to trigger the logger's emergency flush mechanism * Use the buffer's mutex for asynchronous flushes * Remove vestigial `nix` dependency from one of the previous commits

koute requested review from gilescope and bkchr October 8, 2021 12:43

koute force-pushed the master_faster_logging branch from 3e3bf2d to a07611c Compare October 11, 2021 05:35

koute added the B0-silent Changes should not be mentioned in any release notes label Oct 11, 2021

gilescope reviewed Oct 11, 2021

View reviewed changes

client/tracing/src/logging/stderr_writer.rs Show resolved Hide resolved

gilescope approved these changes Oct 11, 2021

View reviewed changes

bkchr reviewed Oct 18, 2021

View reviewed changes

bkchr reviewed Oct 20, 2021

View reviewed changes

client/tracing/src/logging/stderr_writer.rs Outdated Show resolved Hide resolved

bkchr approved these changes Oct 21, 2021

View reviewed changes

koute added 7 commits October 21, 2021 22:37

Update tracing-related dependencies

bda46c2

Enable parking_lot feature in tracing-subscriber

853214f

Add an asynchronous stderr logger

00e85ee

Make clippy happy

3981339

Add an integration test for the logger

d6947b6

Refactor test_logger_filters's subprocess machinery into a separate…

5ae74c8

… function

Use a child process instead of hooking into stderr for the test

0fd0f24

koute added 5 commits October 21, 2021 22:38

Add a doc comment for MakeStderrWriter

08103bd

Move the initialization into the MakeStderrWriter's constructor

bfa8f52

Add an extra test case to trigger the logger's emergency flush mechanism

be83a6c

Use the buffer's mutex for asynchronous flushes

ce69d8c

Remove vestigial nix dependency from one of the previous commits

949d1e1

koute force-pushed the master_faster_logging branch from 62981af to 949d1e1 Compare October 21, 2021 13:41

paritytech-processbot bot merged commit 632b323 into paritytech:master Oct 21, 2021

github-actions bot mentioned this pull request Dec 2, 2021

Update substrate/polkadot/cumulus from v0.9.12 to v0.9.13 moonbeam-foundation/moonbeam#1050

Closed

koute mentioned this pull request Sep 29, 2022

Member Request polkadot-fellows/seeding#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up logging once again #9981

Speed up logging once again #9981

koute commented Oct 8, 2021

bkchr commented Oct 9, 2021

koute commented Oct 11, 2021

koute commented Oct 11, 2021

gilescope commented Oct 11, 2021

koute commented Oct 11, 2021

gilescope commented Oct 12, 2021

bkchr Oct 18, 2021

koute Oct 18, 2021

bkchr Oct 18, 2021

koute Oct 18, 2021

bkchr Oct 18, 2021

koute Oct 18, 2021

bkchr Oct 20, 2021

bkchr Oct 20, 2021

koute Oct 21, 2021

bkchr Oct 21, 2021

bkchr left a comment

bkchr Oct 21, 2021

koute commented Oct 21, 2021

bkchr commented Oct 21, 2021

koute commented Oct 21, 2021

bkchr commented Oct 21, 2021

koute commented Oct 21, 2021

bkchr commented Oct 21, 2021

koute commented Oct 21, 2021

koute commented Oct 21, 2021

Speed up logging once again #9981

Speed up logging once again #9981

Conversation

koute commented Oct 8, 2021

bkchr commented Oct 9, 2021

koute commented Oct 11, 2021

koute commented Oct 11, 2021

gilescope commented Oct 11, 2021

koute commented Oct 11, 2021

gilescope commented Oct 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koute commented Oct 21, 2021

bkchr commented Oct 21, 2021

koute commented Oct 21, 2021

bkchr commented Oct 21, 2021

koute commented Oct 21, 2021

bkchr commented Oct 21, 2021

koute commented Oct 21, 2021

koute commented Oct 21, 2021