-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With burst-heavy writes, eta and bitrate show incorrect values when using enable_steady_tick #534
Comments
Nice work with the detailed repro instructions! Using your log file, I see the same behaviour as you, where it dips below 1000KB/s after about 15-20 seconds. For comparison, I modified your script to use my own moving average calculation: https://gist.github.com/DonSheddow/2cbcd2f334ee1244ae974cff5348a068. This doesn't display the same behaviour - the bitrate stays between 2 and 3 MB/s. What I don't understand is that when the tick interval is set to 250ms, a rolling buffer of the last 16 ticks should be the same as a rolling buffer of the last 4 seconds. So in principle my MovingAvgRate should be the same as the builtin bitrate calculation... The only difference I can see is that the builtin estimator calculates the slope of each step, then averages them to find the number of seconds per step, then takes the reciprocal of that to get the bitrate; maybe all of the floating point operations introduces a non-negligible precision loss? |
Haha, thanks for pointing me right at the error. This is a basic math mistake. In other words, what we have in the buffer is the amount of time taken per step over each tick. To take some arbitrary sample data, this looks something like:
When you average these values, you treat the denominators as equivalent. This weights samples with fewer steps heavier than samples with more steps. So for example, the data above becomes
Averaging these ticks gives you 0.3 secs / thousand steps, which seems reasonable enough. E.g. if there are 10000 steps to go through, that's 3 seconds. Now suppose you have a fourth tick, where you pass essentially no data, let's say 0.25 seconds / 1 step. Averaging these ticks gives you 63 seconds per thousand steps, or 630 seconds for the 10000 step job. Obviously this makes no sense. You are I are doing it correctly (assuming steady ticks), by starting with the reciprocal. This treats "short" ticks as equivalent to any other tick, so it will be somewhat inaccurate if the ticks aren't steady. So with this approach we have:
Averaging these, we get 2551 steps / sec, or 3.9 seconds for 10000 steps. This is obviously the better estimate. That said, you can do it properly by also tracking the time consumed for each tick in a separate buffer. I'll submit a pull request that does this. |
Aha, well spotted! So even though the rate is the reciprocal of seconds per step, the average rate is not the reciprocal of the average of seconds per step. Looking further into it, what the builtin estimator is actually calculating is the harmonic mean of the rate. This works correctly only when each sample counts the same amount of steps (whereas the arithmetic mean is correct when each sample spans the same amount of time). So if you only did I think the best way to calculate the slope is just storing the position in the buffer. Just subtract the end position from the start position and divide by the elapsed time, then you get the correct rate, without having to worry about uneven ticks. Plus you save quite a few arithmetic operations. Example: |
@DonSheddow good idea. I initially discounted that approach, because it looked hard to reconcile with the existing state code. But upon a second look, it's really not that bad, mostly just requires changing a test. I've updated the pull request to do this instead. Thanks! |
@afontenot @DonSheddow thank you for working on this issue! Would be happy if you were able to look at #394 which collects a bunch of other feedback from folks on issues/limitations in the current estimator. I have been thinking that it would be good to adopt some kind of exponential moving average but haven't been able to spend enough time to work out the details. Would love a well-reasoned proposal to improve on the current state! |
@djc I'll have a look at that. I think handling this properly would rely on having an opinion on e.g. the correct number of seconds, not ticks, to average over - regardless of whether you used a flat or exponential average. Some sort of default is needed here. Just for clarity's sake: my PR is not intended to change the current behavior away from "the average rate over the last 15 ticks". Instead it just corrects a math error in how that was done (as discussed in the comments above). So you could (when you have the time to review it) merge the PR fixing the issue, and revisit the approach at any point in the future. Maybe we can improve things right now, but I just wanted to clarify the difference between what I reported here and the issues discussed in #394. |
This issue is fixed indirectly, because the approach to calculating ETA was changed in #539. |
TL;DR: if you use
enable_steady_tick
, andset_position
on every write, a burst-oriented workload will show incorrect values for ETA and bitrate - not just heavily fluctuating, but impossibly low values given the true speed of the connection.I have the following two assumptions about how the progress bar behaves. These might need to be corrected.
enable_steady_tick
is enabled, indicatif only ticks on the clock you set. Based on the docs.I was debugging a progress bar issue with magic-wormhole.rs that was hard to reproduce. The issue seems to stem from how burst-heavy my system was. I regularly see the program fill the outgoing TCP buffer, and then wait for half a second or more before continuing. The way the program works is that it calls
set_position
every time it writes into the buffer. It sets the steady tick in indicatif to 250 ms. Based on this, I expect the progress bar to fluctuate a decent amount, but on average it should be close to the true bitrate of the file transfer.In fact, I see extremely strong fluctuations, and the indicatif progress bar is almost always significantly below the real speed of the transfer (based on the file size and transfer time).
Based on the assumptions I gave above, I recreated a moving average (using matplotlib in Python) across writes from the last
15 * 250 ms
.As you can see, there's some undesirable variability here (it would be nice if we could set indicatif to average over 60 samples or something like that), but there are no extreme dips in the moving average. On the contrary, indicatif frequently shows me bitrates under 1 Mbps for this connection, and appears to tick irregularly as well! Something seems to have gone badly wrong with the estimation.
Reproducing issues like this in a complex program is difficult, because it relies on buffer size, network conditions, etc. For this reason, I've written a simple replay tool for indicatif. This is a single-file script (using rust-script) that allows you to replay logs from indicatif using a trivial log format. This should enable anyone debugging this issue to recreate exactly what I am seeing.
Here is my log file:
indicatif.log
The text was updated successfully, but these errors were encountered: