-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Questions] Regarding Data Science Pipelines #11
Comments
@andreaTP thanks for the questions! I manage the core team developing Data Science Pipelines. @opendatahub-io/data-science-pipelines-maintainers can you guys chime in here? |
@andreaTP Thank you so much for the questions! Our decision with Kubeflow Pipelines came from the following assumptions:
As for the minimum requirements and proposed configurations, the reference you sent is very outdated (it came from v0.6 of the Kubeflow docs, we are currently using v1.6) but let me explain how we implemented the solution. We created an operator to deploy the whole stack in multiple namespaces. We compared single shared stack vs. multiple stacks, one per namespace, and we decided to go over the multiple stacks (The ADR describes some of the alternatives we considered). When we say "stack", we mean the whole Kubeflow Pipelines installation including the database and object store, but these components can be external services that the stack can use. We found other issues that a single stack will make Data Science Pipelines more complicated to use, including security issues. We also ran a perf test where we deployed 2k stacks and the resource consumption for the operator seemed reasonable to us. If we compare the minimum requirements in that link, maybe it's not a valid comparison with our solution because:
Hope that clarifies how we decided with Kubeflow Pipelines and the solution we decided to implement. |
Hi @rimolive thanks a ton for taking the time to share those answers! This sheds light on the motivations and background work supporting the decisions, let me ask a few follow-up questions to ensure I understood the full picture 🙂 .
Here I read that Kubeflow has been considered a "natural fit" does this means that no other technology has been evaluated in this context?
Can you expand on the "other solutions" compared?
This is a very interesting decision!
Do you have a reference for updated numbers?
This sounds great!
Fair, is there a plan to have an updated estimation? Thanks a lot in advance, your answers are really appreciated! |
Not sure how long have you been following our roadmap, but we tried to bring Airflow to ODH components list. Airflow, along with Argo, were the ones we considered before kfp. Because those weren't cloud-native solutions at the time we were evaluating options, in addition to the fact that kfp was more focused on MLOps tasks, made us to decide for kfp.
See my previous answer
I don't know if we have publicly documented it somewhere. I'll check if we have, and share in this issue.
Unfortunately, no. This was the numbers collected by kubeflow team, and since we have a different configuration we expect to run these perf tests.
I'll check that info and share it in this issue.
We'd like to run a perf test to verify the current configuration constraints, but the engineering team has other priorities right now, such as integrate the rest of the kfp components and v2 migration when GA is released. |
Hi all!
It's amazing that we can actually look up the ADRs in this repository! Thanks a lot for the openness! 🙏
I was going through this one: https://github.com/opendatahub-io/architecture-decision-records/blob/main/ODH-ADR-0002-data-science-pipelines-multi-user-approach.md
and I have some follow-up questions, I hope this is the right place 🙂
Here I read that Kubeflow is the technology to be used, is it a requirement? An assumption? Or have we evaluated alternatives and decided to use Kubeflow? In the latter case, I'm super interested in having access to the comparison!
On the Kubeflow documentation, I can find that the minimum requirements are pretty significant, do we have estimations of the system requirements in the proposed configurations(e.g. shared vs. local minio and postgres) and adding the operator needs? how much do we expect this to scale on a user's cluster? Have we explored alternatives or do we expect the users to use a single installation in a dedicated namespace on each cluster? - I think that answers to these questions are relevant information that should validate the decision taken.
Thanks a lot in advance!
The text was updated successfully, but these errors were encountered: