Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate scheduler requests in P2P #8899

Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Oct 17, 2024

In P2P, a worker may send many concurrent requests to the scheduler if it is not aware of a specific shuffle run. This could lead to DDoS-like scenarios on large-scale clusters with P2P rechunking. This PR prohibits concurrent fetch requests.

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait marked this pull request as ready for review October 17, 2024 11:17
@jacobtomlinson
Copy link
Member

jacobtomlinson commented Oct 17, 2024

Last week @phofl mentioned that it would be great to get more reviews on PRs as that is currently a point of friction for folks actively working on Dask/Distributed.

However I find it hard to dive into reviewing a PR like this as it doesn't close an open issue and it doesn't have a description of what the change does or any examples of what this fixes. I'd love to help out more with reviews here, but I don't know where to begin with this one.

cc @fjetter

@hendrikmakait
Copy link
Member Author

@jacobtomlinson, it looks like you wrote this just as I added a description. Is this helpful? (Sorry for having it "ready for review" without a description, lost my internet connection for a couple of minutes so the update to the description didn't go out in sync with the change in readiness.)

Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    25 files  ± 0      25 suites  ±0   10h 16m 54s ⏱️ + 3m 10s
 4 131 tests + 1   4 019 ✅ ±0    110 💤 ±0  2 ❌ +1 
47 718 runs  +10  45 620 ✅ +8  2 096 💤 +1  2 ❌ +1 

For more details on these failures, see this check.

Results for commit 2257deb. ± Comparison against base commit fa9806b.

@jacobtomlinson
Copy link
Member

Fair enough. I'd been digging through the code for 10 mins at the point you updated the description. This also isn't a one off, see #8809, #8852, #8834 and #8898 for some more recent examples.

I think my point is that I'd like to help out more with reviews, but it can be challenging.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this lock makes sense to me to slow down concurrent connections.

@jacobtomlinson jacobtomlinson merged commit 48509b3 into dask:main Oct 17, 2024
28 of 31 checks passed
@fjetter
Copy link
Member

fjetter commented Oct 17, 2024

@jacobtomlinson The examples you're listing are surprising for me

I understand some criticism about the first (bad title) and last (lacking context) but the others are fine, aren't they?

@jacobtomlinson
Copy link
Member

That's fair, I just grepped through the last couple of pages of distributed looking for PRs with little/no context. I appreciate some of them may be trivial PRs, but I expect if I went digging I would find more examples like dask/dask-expr#1146 or dask/dask-expr#1138.

I'm very happy to help out with PR review if that's something that would be valuable, but folks outside of the Coiled team have a lot less context about why things are being done, so it would be really helpful to have more descriptive PRs.

@fjetter
Copy link
Member

fjetter commented Oct 18, 2024

My sense is that generally we do provide decent descriptions. I don't really want to dig through the past and discuss bad issues or PRs now. I don't think this is the reason why few people are helping out. I'm happy to revisit this if this is a persistent issue

@hendrikmakait hendrikmakait deleted the avoid-ddosing-scheduler-in-p2p branch October 18, 2024 08:35
@jacobtomlinson
Copy link
Member

jacobtomlinson commented Oct 18, 2024

I don't really want to dig through the past and discuss bad issues or PRs now

Thats totally fair

I don't think this is the reason why few people are helping out

I'm trying to feed back to you all that it's one of the reasons why I help out less than I used to. I hope you can be receptive to that. I'm also trying to communicate that I want to help more.

Sorry that this turned into such a big deal. My intent was to gently nudge so that I can get more involved with reviews. But it's blown up a bit so apologies for any bad feeling generated from this discussion.

@hendrikmakait
Copy link
Member Author

I'm trying to feed back to you all that it's one of the reasons why I help out less than I used to. I hope you can be receptive to that. I'm also trying to communicate that I want to help more.

It's noted and appreciated. To me, this seems like your usual chicken-and-egg problem. With little external engagement around reviews or code contributions, there's little time for writing good descriptions (and less value in doing so). With bad descriptions, there's little external engagement.

Let's try to get to a minimal standard where we write something that's helpful but also doesn't take up much time. Personally, I'd say: Feel free to ping people on PRs you'd like to review but lack sufficient context. We'll just have to find a good middle ground here so that we don't blow the effort for documentation out of proportion.

@jacobtomlinson
Copy link
Member

Personally, I'd say: Feel free to ping people on PRs you'd like to review but lack sufficient context.

This is a good suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants