Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing test_bandwidth #3999

Merged
merged 3 commits into from
Jul 31, 2020
Merged

Conversation

jakirkham
Copy link
Member

It seems this test has been failing in a few places. Here's one build that failed. This tweaks the test to get it to pass again. Not sure if this is the right solution. So am open to other ideas.

@jakirkham
Copy link
Member Author

Looks like there is a deeper issue...

>       assert typename(bytes) in s.bandwidth_types
1695E       assert 'builtins.bytes' in defaultdict(<class 'float'>, {})
1696E        +  where 'builtins.bytes' = typename(bytes)
1697E        +  and   defaultdict(<class 'float'>, {}) = <Scheduler: "tcp://127.0.0.1:38669" processes: 2 cores: 3>.bandwidth_types

https://travis-ci.org/github/dask/distributed/jobs/713046159#L1582-L1585

@jakirkham
Copy link
Member Author

I'm guessing we expect to exercise this code path, but that doesn't seem to be happening for some reason.

@jakirkham
Copy link
Member Author

Should add I'm not able to reproduce this issue outside of CI.

@jakirkham jakirkham requested a review from mrocklin July 29, 2020 22:08
@quasiben
Copy link
Member

Thanks for looking at this John. I am also investigating -- I don't see why this would start failing

@quasiben
Copy link
Member

After a bit more investigation I think this issue might come down to sizeof calculations. In dask/dask#6457 we changes the sizeof measurement for bytes objects from sys.getsizeof to len

In [1]: import sys

In [2]: arr = b"0" * 1000000

In [3]: len(arr)
Out[3]: 1000000

In [4]: sys.getsizeof(arr)
Out[4]: 1000033

And, if you checkout the commit just prior to PR dask/dask#6457 everything passes. I suspect the change here is ok. And will merge late afternoon unless others want to comment

@@ -1635,11 +1635,11 @@ async def test_idle_timeout(c, s, a, b):
@gen_cluster(client=True, config={"distributed.scheduler.bandwidth": "100 GB"})
async def test_bandwidth(c, s, a, b):
start = s.bandwidth
x = c.submit(operator.mul, b"0", 1000000, workers=a.address)
x = c.submit(operator.mul, b"0", 1000001, workers=a.address)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using len the metrics["bandwidth"]["total"] is reported to be 100000000000, always. When using getsize of the metrics["bandwidth"]["total"] at the end is 95012655208.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I think I'm still confused on why we changed 1000000 to 1000001 here though.

Copy link
Member

@quasiben quasiben Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit more digging we need to increase here because we only update the bandwidth if the total data is larger than 1_000_000:

if total_bytes > 1000000:
self.bandwidth = self.bandwidth * 0.95 + bandwidth * 0.05
bw, cnt = self.bandwidth_workers[worker]
self.bandwidth_workers[worker] = (bw + bandwidth, cnt + 1)
types = set(map(type, response["data"].values()))
if len(types) == 1:
[typ] = types
bw, cnt = self.bandwidth_types[typ]
self.bandwidth_types[typ] = (bw + bandwidth, cnt + 1)

So to wrap this all up, when using getsize we are just over the line and we update bandwidth. My guess is that we picked that value because we didn't want to update when passing small data such as heartbeats

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, that makes sense. Thanks for the explanation @quasiben !

@jakirkham
Copy link
Member Author

To answer your questions, I don't know. Just trying things to see how we fix this. Happy to try something different. Suggestions welcome 🙂

@quasiben
Copy link
Member

@jakirkham I've update the test in the last commit I think after it passes CI we are good to merge. Thank you for your work here!

@quasiben
Copy link
Member

Woot! Tests pass. Thanks again @jakirkham for the work and @mrocklin for the reviews

@quasiben quasiben merged commit e5c544b into dask:master Jul 31, 2020
@jakirkham jakirkham deleted the fix_test_bandwidth branch July 31, 2020 16:39
@jakirkham
Copy link
Member Author

Thanks all! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants