Deduplicate scheduler requests in P2P #8899

hendrikmakait · 2024-10-17T10:38:30Z

In P2P, a worker may send many concurrent requests to the scheduler if it is not aware of a specific shuffle run. This could lead to DDoS-like scenarios on large-scale clusters with P2P rechunking. This PR prohibits concurrent fetch requests.

Tests added / passed
Passes pre-commit run --all-files

jacobtomlinson · 2024-10-17T11:26:53Z

Last week @phofl mentioned that it would be great to get more reviews on PRs as that is currently a point of friction for folks actively working on Dask/Distributed.

However I find it hard to dive into reviewing a PR like this as it doesn't close an open issue and it doesn't have a description of what the change does or any examples of what this fixes. I'd love to help out more with reviews here, but I don't know where to begin with this one.

cc @fjetter

hendrikmakait · 2024-10-17T11:32:13Z

@jacobtomlinson, it looks like you wrote this just as I added a description. Is this helpful? (Sorry for having it "ready for review" without a description, lost my internet connection for a couple of minutes so the update to the description didn't go out in sync with the change in readiness.)

github-actions · 2024-10-17T11:33:03Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

25 files ± 0 25 suites ±0 10h 16m 54s ⏱️ + 3m 10s
4 131 tests + 1 4 019 ✅ ±0 110 💤 ±0 2 ❌ +1
47 718 runs +10 45 620 ✅ +8 2 096 💤 +1 2 ❌ +1

For more details on these failures, see this check.

Results for commit 2257deb. ± Comparison against base commit fa9806b.

jacobtomlinson · 2024-10-17T12:52:02Z

Fair enough. I'd been digging through the code for 10 mins at the point you updated the description. This also isn't a one off, see #8809, #8852, #8834 and #8898 for some more recent examples.

I think my point is that I'd like to help out more with reviews, but it can be challenging.

jacobtomlinson

Overall this lock makes sense to me to slow down concurrent connections.

fjetter · 2024-10-17T14:17:40Z

@jacobtomlinson The examples you're listing are surprising for me

Fix if-else for send_recv_from_rpc #8809 An updated warning message. I admit that the title is confusing but I don't know if this needs a more thorough description
Update precommit #8852 is a precommit update. There's nothing more to say, is there?
Reduce frequency of unmanaged memory use warning #8834 has a decent title and a short description. Change is a hard coded config value
Add configurations for rootish taskgroup threshold #8898 I guess this is legit. This does not include any context for why this is useful but I also don't expect anybody else to dive right into task queuing

I understand some criticism about the first (bad title) and last (lacking context) but the others are fine, aren't they?

jacobtomlinson · 2024-10-17T16:19:55Z

That's fair, I just grepped through the last couple of pages of distributed looking for PRs with little/no context. I appreciate some of them may be trivial PRs, but I expect if I went digging I would find more examples like dask/dask-expr#1146 or dask/dask-expr#1138.

I'm very happy to help out with PR review if that's something that would be valuable, but folks outside of the Coiled team have a lot less context about why things are being done, so it would be really helpful to have more descriptive PRs.

fjetter · 2024-10-18T08:31:02Z

My sense is that generally we do provide decent descriptions. I don't really want to dig through the past and discuss bad issues or PRs now. I don't think this is the reason why few people are helping out. I'm happy to revisit this if this is a persistent issue

jacobtomlinson · 2024-10-18T09:11:17Z

I don't really want to dig through the past and discuss bad issues or PRs now

Thats totally fair

I don't think this is the reason why few people are helping out

I'm trying to feed back to you all that it's one of the reasons why I help out less than I used to. I hope you can be receptive to that. I'm also trying to communicate that I want to help more.

Sorry that this turned into such a big deal. My intent was to gently nudge so that I can get more involved with reviews. But it's blown up a bit so apologies for any bad feeling generated from this discussion.

hendrikmakait · 2024-10-18T09:35:49Z

I'm trying to feed back to you all that it's one of the reasons why I help out less than I used to. I hope you can be receptive to that. I'm also trying to communicate that I want to help more.

It's noted and appreciated. To me, this seems like your usual chicken-and-egg problem. With little external engagement around reviews or code contributions, there's little time for writing good descriptions (and less value in doing so). With bad descriptions, there's little external engagement.

Let's try to get to a minimal standard where we write something that's helpful but also doesn't take up much time. Personally, I'd say: Feel free to ping people on PRs you'd like to review but lack sufficient context. We'll just have to find a good middle ground here so that we don't blow the effort for documentation out of proportion.

jacobtomlinson · 2024-10-18T11:10:07Z

Personally, I'd say: Feel free to ping people on PRs you'd like to review but lack sufficient context.

This is a good suggestion!

Deduplicate requests to scheduler in P2P

2257deb

hendrikmakait marked this pull request as ready for review October 17, 2024 11:17

hendrikmakait requested a review from fjetter as a code owner October 17, 2024 11:17

jacobtomlinson approved these changes Oct 17, 2024

View reviewed changes

jacobtomlinson merged commit 48509b3 into dask:main Oct 17, 2024
28 of 31 checks passed

hendrikmakait deleted the avoid-ddosing-scheduler-in-p2p branch October 18, 2024 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate scheduler requests in P2P #8899

Deduplicate scheduler requests in P2P #8899

hendrikmakait commented Oct 17, 2024 •

edited

Loading

jacobtomlinson commented Oct 17, 2024 •

edited

Loading

hendrikmakait commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

jacobtomlinson commented Oct 17, 2024

jacobtomlinson left a comment

fjetter commented Oct 17, 2024

jacobtomlinson commented Oct 17, 2024

fjetter commented Oct 18, 2024

jacobtomlinson commented Oct 18, 2024 •

edited

Loading

hendrikmakait commented Oct 18, 2024

jacobtomlinson commented Oct 18, 2024

Deduplicate scheduler requests in P2P #8899

Deduplicate scheduler requests in P2P #8899

Conversation

hendrikmakait commented Oct 17, 2024 • edited Loading

jacobtomlinson commented Oct 17, 2024 • edited Loading

hendrikmakait commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

Unit Test Results

jacobtomlinson commented Oct 17, 2024

jacobtomlinson left a comment

Choose a reason for hiding this comment

fjetter commented Oct 17, 2024

jacobtomlinson commented Oct 17, 2024

fjetter commented Oct 18, 2024

jacobtomlinson commented Oct 18, 2024 • edited Loading

hendrikmakait commented Oct 18, 2024

jacobtomlinson commented Oct 18, 2024

hendrikmakait commented Oct 17, 2024 •

edited

Loading

jacobtomlinson commented Oct 17, 2024 •

edited

Loading

jacobtomlinson commented Oct 18, 2024 •

edited

Loading