Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Resolve Ray Futures Decentralized & Concurrently #69

Merged

Conversation

toro-berlin
Copy link
Contributor

@toro-berlin toro-berlin commented Feb 9, 2023

Closes #37

Problem Statement
We recently ran into a performance bottleneck with the current approach of resolving the Ray futures. We ran into these issues with a flow that uses task dependencies. Please take a look at the flow structure below.

Consulting the logs shows that the limiting factor was the networking: We saw a lot of httpx and OpenSSL errors (timeout during handshake, IOerror, ...).

Hypothesis
Resolving the Ray futures in a centralized manner puts a lot of pressure on the networking stack and Prefect API. Implementing the changes proposed in #37 will eliminate this bottleneck.

First Tests
I conducted test runs with my minimum reproducible example (flow executed on my local machine, Ray Cluster provided by Anyscale) and real-world flow + workload (flow executed on AWS EKS flow pod runner, Ray Cluster provided by Anyscale).
They show that we do not run into networking errors anymore with the changes proposed in this PR.

Example

Flow Structure

for _ in range(do_it_for_a_lot_of_input):
    a_future = a.submit(wait_for=[])
    b_future = b.submit(wait_for=[a_future])

    c_future = c.submit(wait_for=[b_future])
    d_future = d.submit(wait_for=[b_future])
    e_future = e.submit(wait_for=[b_future])

Screenshots

Checklist

  • References any related issue by including "Closes #" or "Closes ".
    • If no issue exists and your change is not a small fix, please create an issue first.
  • Includes tests or only affects documentation.
  • Passes pre-commit checks.
    • Run pre-commit install && pre-commit run --all locally for formatting and linting.
  • Includes screenshots of documentation updates.
    • Run mkdocs serve view documentation locally.
  • Summarizes PR's changes in CHANGELOG.md

@zanieb
Copy link
Contributor

zanieb commented Feb 9, 2023

🎉 this looks like the right idea!

@toro-berlin toro-berlin changed the title WIP: Resolve Ray Futures Decentralized & Concurrently Resolve Ray Futures Decentralized & Concurrently Feb 10, 2023
@toro-berlin
Copy link
Contributor Author

Hey @madkinsz, hey @ahuang11,

Thank you for providing the RayTaskRunner for executing Prefect tasks on a Ray cluster. I wrote down my recent challenges with the actual implementation in the PR description. It looks like implementing the changes proposed in #37 will do the trick.

  • I am happy for a review from your side.
  • Also asked Anyscale for feedback on this PR (cc @pcmoritz).
  • One question regarding the tests: I saw that a standard test suite exists for task runner in the prefect repository (src/prefect/testing/standard_test_suites/task_runners.py). Does it make sense to use this one here as well? Would open an issue then...

Have a great start in the week
Tobias

@pcmoritz
Copy link
Contributor

@toro-berlin Thanks a lot for submitting this PR.

I don't know this code well enough yet to be able to really give useful feedback, but here are some thoughts: I wonder if it would be possible to not have to call ray.get on the future at all and let the Ray scheduler figure out the resolution -- you can pass ObjectRefs into ray.remote calls, and the Ray scheduler will figure out when the objects are available and call the task. That would be the most scalable way to handle this.

I can spend a little more time reading the code tomorrow and seeing if what I'm saying can actually be done / how this all works :)

In the meantime @madkinsz and @ahuang11 please feel free to merge the PR and if you have any documentation / pointers for me to understand this code better, that would be very helpful :)

@desertaxle
Copy link
Member

Thanks for this PR @toro-berlin! I'd like to merge these changes so that we can unblock you and others that are experiencing this issue. Could you please add a short summary of your changes to the Fixed section of the change log? Once that's added and this PR is marked as ready for review, I'll approve an merge!

@toro-berlin toro-berlin marked this pull request as ready for review February 15, 2023 11:44
@toro-berlin
Copy link
Contributor Author

Great to hear, @desertaxle.
Added an entry to the changelog and marked this PR as read-for-review.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle dependent tasks concurrently in the RayTaskRunner
4 participants