-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Speed up logging once again #9981
Speed up logging once again #9981
Conversation
Why not use this: https://docs.rs/tracing-appender/0.1.2/tracing_appender/non_blocking/index.html ? |
That's a great question!
|
3e3bf2d
to
a07611c
Compare
(Force-pushed to resolve conflicts.) |
A test would be nice to prove that while spinning up n threads all logging that all generated log messages did get logged and none fell through any cracks. |
I've added an integration test which does just that. |
Nice test. Good that nothing is going astray. |
if !ENABLE_ASYNC_LOGGING.load(Ordering::Relaxed) { | ||
sync_flush_on_drop = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the logger was already deinitialized but something still wants to log something for some reason then instead of appending that to a buffer which will never be flushed (because the background thread which does the flushing is most likely already dead at this point) it will instead just log it synchronously.
Basically, a just-in-case handling of a potential corner case.
pub struct StderrWriter { | ||
buffer: Option<parking_lot::lock_api::MutexGuard<'static, parking_lot::RawMutex, Vec<u8>>>, | ||
sync_flush_on_drop: bool, | ||
original_len: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is that not 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer is not always immediately flushed; as long as the amount of data in the buffer is less than ASYNC_FLUSH_THRESHOLD
then it will only be flushed every AUTOFLUSH_EVERY
seconds (50ms in this case), e.g. if you call log!("Hello World!")
two times in rapid succession then on the second call the original_len
will be non-zero, since the buffer will still contain the data from the first call which wasn't yet flushed (because neither the amount of data exceeded the ASYNC_FLUSH_THRESHOLD
nor the amount of time which has passed since the last flush exceeded AUTOFLUSH_EVERY
, so the thread which does the flushing is still sleeping).
extern "C" fn on_exit() { | ||
ENABLE_ASYNC_LOGGING.store(false, Ordering::SeqCst); | ||
|
||
if let Some(buffer) = BUFFER.try_lock_for(ON_EXIT_FLUSH_TIMEOUT) { | ||
flush_logs(buffer); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Couldn't the Drop
implementation just set ENABLE_ASYNC_LOGGING=false
and the thread finishes when it sees this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some way to detect that we're exiting, and yes, we could technically do that through Drop
, but that would require us to actually make sure it gets dropped. We can't make that a static
as Rust doesn't actually call drop
on anything static. We could do it through the .fini
or .dtors
linker sections, but that's unportable (and possibly nightly-only; haven't used those in a long time so I don't remember if it's possible to use them on stable Rust already). We could use a thread local and check whenever we're on the thread which initialized the logger and then run drop
(since TLS destructors do get run), but that's kind of a hack. We could have the logger initializer return a handle which would control the lifetime, but that's inconvenient and potentially error prone in this case.
And depending on which drop variant we'd go for the logger might not be flushed if libc::exit
is called, which, yes, probably doesn't normally happen, but I think it's a good idea to cover all of your bases.
So unless I'm missing something what's left is registering a handler through the standard atexit
.
(Of course this still doesn't cover a few cases, e.g. SIGSEGV from a segmentation fault and SIGABRT from abort
being called, but I'm not sure if we want to go that far as that's a little more tricky since it involves signal handling.)
As far as why we're doing the try_lock_for
here - on exit the threads are just going to be forcibly killed, so we want to at least give the thread here some time to flush itself, but to not indefinitely hang if for some reason it can't do it (because e.g. the other side of stderr
stopped reading).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But StderrWriter
is dropped at some point that could trigger the logic I described above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean the MakeStderrWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If drop
would actually be called for MakeStderrWriter
then yeah, but it isn't from what I can see. (Under the hood the whole structure which owns the MakeStderrWriter
is assigned to a static mut
and as any other static
in Rust it's never dropped.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. Good work 👍
The only problem I see is around aborts, but they are very rarely anyway.
One last nitpick and then it is good to be merged.
extern "C" fn on_exit() { | ||
ENABLE_ASYNC_LOGGING.store(false, Ordering::SeqCst); | ||
|
||
if let Some(buffer) = BUFFER.try_lock_for(ON_EXIT_FLUSH_TIMEOUT) { | ||
flush_logs(buffer); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Makes sense.
Hmmm.... it looks like the The only reason I can think of as to why it could have regressed is that I've also bumped I'll try to revert the |
@koute no don't do this. This test is failing everywhere, it depends on the machine it is executed on... |
Ah, okay, that's good to know! I was just about to run it locally. (: |
@koute merging master now should fix the benchmark test. |
62981af
to
949d1e1
Compare
Okay, rebased on the newest |
@koute in the future please you merge commits. We squash merge to master anyway and that makes it easier for reviewers. |
My apologies; force of habit. I'll be sure to just use merge commits when possible in the future. |
bot merge |
* Update `tracing`-related dependencies * Enable `parking_lot` feature in `tracing-subscriber` * Add an asynchronous stderr logger * Make clippy happy * Add an integration test for the logger * Refactor `test_logger_filters`'s subprocess machinery into a separate function * Use a child process instead of hooking into stderr for the test * Add a doc comment for `MakeStderrWriter` * Move the initialization into the `MakeStderrWriter`'s constructor * Add an extra test case to trigger the logger's emergency flush mechanism * Use the buffer's mutex for asynchronous flushes * Remove vestigial `nix` dependency from one of the previous commits
* Update `tracing`-related dependencies * Enable `parking_lot` feature in `tracing-subscriber` * Add an asynchronous stderr logger * Make clippy happy * Add an integration test for the logger * Refactor `test_logger_filters`'s subprocess machinery into a separate function * Use a child process instead of hooking into stderr for the test * Add a doc comment for `MakeStderrWriter` * Move the initialization into the `MakeStderrWriter`'s constructor * Add an extra test case to trigger the logger's emergency flush mechanism * Use the buffer's mutex for asynchronous flushes * Remove vestigial `nix` dependency from one of the previous commits
This PR tries to speed up our logging machinery by introducing a buffered semi-asynchronous stderr writer.
(In our profiling on one of the validators we've seen ~5% of total time being spent writing to stderr.)
For more details please read the doc comment included in the
stderr_writer.rs
.As far as benchmarks go, this is pretty hard to benchmark; depending on how the benchmark's set up it can either show zero improvement, or a huge improvement. Nevertheless in a microbenchmark which tries to replicate similar conditions to what we're running on our validator nodes (which is: executable is launched as a systemd service, and then just keeps on puking logs into the systemd journal) this PR cuts down the total execution time of the microbenchmark from ~8s to ~2s.
(While I'm at it I've also enabled the
parking_lot
feature for thetracing-subscriber
crate which measurably lowers the overhead in case where we're not printing anything; there's still more work that can be done for that case, but that will come in another PR.)