-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix failing test_bandwidth
#3999
Conversation
Looks like there is a deeper issue...
https://travis-ci.org/github/dask/distributed/jobs/713046159#L1582-L1585 |
I'm guessing we expect to exercise this code path, but that doesn't seem to be happening for some reason. |
Should add I'm not able to reproduce this issue outside of CI. |
Thanks for looking at this John. I am also investigating -- I don't see why this would start failing |
After a bit more investigation I think this issue might come down to sizeof calculations. In dask/dask#6457 we changes the sizeof measurement for bytes objects from In [1]: import sys
In [2]: arr = b"0" * 1000000
In [3]: len(arr)
Out[3]: 1000000
In [4]: sys.getsizeof(arr)
Out[4]: 1000033 And, if you checkout the commit just prior to PR dask/dask#6457 everything passes. I suspect the change here is ok. And will merge late afternoon unless others want to comment |
@@ -1635,11 +1635,11 @@ async def test_idle_timeout(c, s, a, b): | |||
@gen_cluster(client=True, config={"distributed.scheduler.bandwidth": "100 GB"}) | |||
async def test_bandwidth(c, s, a, b): | |||
start = s.bandwidth | |||
x = c.submit(operator.mul, b"0", 1000000, workers=a.address) | |||
x = c.submit(operator.mul, b"0", 1000001, workers=a.address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using len
the metrics["bandwidth"]["total"]
is reported to be 100000000000
, always. When using getsize
of the metrics["bandwidth"]["total"]
at the end is 95012655208
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I think I'm still confused on why we changed 1000000 to 1000001 here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a bit more digging we need to increase here because we only update the bandwidth if the total data is larger than 1_000_000:
distributed/distributed/worker.py
Lines 2020 to 2029 in 7cdba1d
if total_bytes > 1000000: | |
self.bandwidth = self.bandwidth * 0.95 + bandwidth * 0.05 | |
bw, cnt = self.bandwidth_workers[worker] | |
self.bandwidth_workers[worker] = (bw + bandwidth, cnt + 1) | |
types = set(map(type, response["data"].values())) | |
if len(types) == 1: | |
[typ] = types | |
bw, cnt = self.bandwidth_types[typ] | |
self.bandwidth_types[typ] = (bw + bandwidth, cnt + 1) |
So to wrap this all up, when using getsize
we are just over the line and we update bandwidth. My guess is that we picked that value because we didn't want to update when passing small data such as heartbeats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that makes sense. Thanks for the explanation @quasiben !
To answer your questions, I don't know. Just trying things to see how we fix this. Happy to try something different. Suggestions welcome 🙂 |
@jakirkham I've update the test in the last commit I think after it passes CI we are good to merge. Thank you for your work here! |
Woot! Tests pass. Thanks again @jakirkham for the work and @mrocklin for the reviews |
Thanks all! 😄 |
It seems this test has been failing in a few places. Here's one build that failed. This tweaks the test to get it to pass again. Not sure if this is the right solution. So am open to other ideas.