-
Notifications
You must be signed in to change notification settings - Fork 5
Resolve Ray Futures Decentralized & Concurrently #69
Resolve Ray Futures Decentralized & Concurrently #69
Conversation
🎉 this looks like the right idea! |
Thank you for providing the RayTaskRunner for executing Prefect tasks on a Ray cluster. I wrote down my recent challenges with the actual implementation in the PR description. It looks like implementing the changes proposed in #37 will do the trick.
Have a great start in the week |
@toro-berlin Thanks a lot for submitting this PR. I don't know this code well enough yet to be able to really give useful feedback, but here are some thoughts: I wonder if it would be possible to not have to call I can spend a little more time reading the code tomorrow and seeing if what I'm saying can actually be done / how this all works :) In the meantime @madkinsz and @ahuang11 please feel free to merge the PR and if you have any documentation / pointers for me to understand this code better, that would be very helpful :) |
Thanks for this PR @toro-berlin! I'd like to merge these changes so that we can unblock you and others that are experiencing this issue. Could you please add a short summary of your changes to the Fixed section of the change log? Once that's added and this PR is marked as ready for review, I'll approve an merge! |
Great to hear, @desertaxle. |
Closes #37
Problem Statement
We recently ran into a performance bottleneck with the current approach of resolving the Ray futures. We ran into these issues with a flow that uses task dependencies. Please take a look at the flow structure below.
Consulting the logs shows that the limiting factor was the networking: We saw a lot of httpx and OpenSSL errors (timeout during handshake, IOerror, ...).
Hypothesis
Resolving the Ray futures in a centralized manner puts a lot of pressure on the networking stack and Prefect API. Implementing the changes proposed in #37 will eliminate this bottleneck.
First Tests
I conducted test runs with my minimum reproducible example (flow executed on my local machine, Ray Cluster provided by Anyscale) and real-world flow + workload (flow executed on AWS EKS flow pod runner, Ray Cluster provided by Anyscale).
They show that we do not run into networking errors anymore with the changes proposed in this PR.
Example
Flow Structure
Screenshots
Checklist
pre-commit
checks.pre-commit install && pre-commit run --all
locally for formatting and linting.mkdocs serve
view documentation locally.