-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a Reusable E2E Kubeflow ML Lifecycle #3728
Implement a Reusable E2E Kubeflow ML Lifecycle #3728
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a typo, should be Data Producers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! @franciscojavierarceo I am wondering, should we add the Data Producers to the Offline Feature store as well?
E.g. Spark ingest data from Data Producers and extract features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also may be useful to add a Feature Extraction
to the Offline Store to make it concrete how the offline store is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! @franciscojavierarceo I am wondering, should we add the Data Producers to the Offline Feature store as well?
Yeah, I think that's a great idea! That can get complicated if we're to get specific but I think if we're just generic and create a box like we do for the online store that works fine.
@andreyvelich What exactly are you trying to accomplish? I didn't fully get this part
Is this diagram supposed to be re-used by each component, and if so, how do you envision that? |
That's right. Please check these examples: We can do the same for Model Registry, Spark Operator, Notebooks if other WGs agree with that. What do you think about it @StefanoFioravanzo ? |
Oh Ok now I understand your approach, I like this. You are proposing we build a canonical Kubeflow ML lifecycle diagram and then highlight what parts of the diagram each component covers. So, based on this, I propose two things:
If you want to keep the focus smaller and have a quicker iteration on the existing diagram, I am fine with it and you can ignore the two points above. |
cc @chasecadet can probably provide some good insight on this |
@andreyvelich a very good open source diagram that we can reuse is this one by the AI Infrastructure Alliance. See here https://github.com/ai-infrastructure-alliance/blueprints There is no explicit license, by the do write in the README:
I think this would be a pretty good starting point for a reusable diagram. They have an editable figma file, and even an interactive version. Take a look at all the folders, there's various versions. We could fork the repository under the Kubeflow org and adapt it to the various component. If we want we could embed the interactive diagram in our website. If we are unsure about licensing and reusability of that content, I can reach out to a couple of folks at AIIA. |
I can see us doing something similar to this interactive version https://ai-infrastructure-alliance.github.io/blueprints/interactive-stack-diagram/stack.html where each option is one of the Kubeflow components. So you can see how the entire Kubeflow platform (we can have a "all" picker) covers the E2E ML lifecycle or based on |
That makes sense, renamed it. |
To be honest, I have concerns with existing diagram, since it was implemented ~ 5 years ago which is very out-of-date. E.g. it doesn't include model fine-tuning which is the modern approach for model development, and it doesn't have online feature store. WDYT @StefanoFioravanzo @franciscojavierarceo ? |
I like there diagrams, but it looks similar to what we have in this PR, isn't ? E.g. the differences:
Maybe we can improve our diagram with additional stages ? |
I agree the old diagram is outdated. I am much more preferential to a diagram that reflects the view of a Data Scientist and the needs in their workflow, which the diagram you proposed does. The AI Infrastructure Aliiance I think highlights things in a way that highlights the needs for different companies with different structure and, while that's helpful, I don't think that elicits clarity on the value of Kubeflow. |
@StefanoFioravanzo finally getting to this! Before I say too much I'd like to take a step back because as we allll know "tactics without vision is just noise before defeat". I like the idea of an ML diagram. I would love to know what our vision for these documents is and how we are approaching this. Someone reads the diagram they learn X and then start building using Y and deliver Z value to their project/org. Allow me to free associate here a bit on what I think would be interesting. I like the idea of talking about use cases for specific components, but I struggle with the idea of telling users what to do. I want to help them envision using these tools and enable them to creatively solve solutions. Another way to say this is I would love if the users told us what they use these components for in collaboration with our vision for these components. We as a community can provide guidance. If we act as a ground truth authority on use cases we might lose out on the value of new community members using the tools in powerful but unexpected ways we can later integrate into more robust use cases. Questions I'd love to have answers to are:
We can touch on trying to say use KFP without a training operator to attempt to run an XGBOOST job vs using and integrating the training operator to show that you "can" do things in MANY ways but may lose out on overall value trying to redo our engineering efforts through your own means.. That being said, stands on soap box Maybe I missed the point of the CC. I also have a chapter in that class I built on the model dev lifecycle. I officially own the content and we can use it how we see fit to create some MLOPs like documents. |
@StefanoFioravanzo @franciscojavierarceo I've made a few updates to the lifecycle diagram based on the feedback. |
Looks great! |
/hold cancel |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@thesuperzapper @StefanoFioravanzo @franciscojavierarceo @hbelmiro I removed changes from the start page in this PR, I will create separate PR to update it. |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only persona shpwn here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
My suggestion will be to remove the ML Engineer Persona
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, but in different use-cases Data Processing can be done by ML Engineerings. Especially when Spark integrated to the Jupyter Notebooks.
This is just an example of ML Lifecycle, I am not sure if we can cover all use-cases and personas here.
WDYT @StefanoFioravanzo @franciscojavierarceo @hbelmiro ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that data preparation is made by data engineers, but considering we need show an e2e flow that covers all kubeflow components and we just brought spark operator to the ecosystem, we should cover data preparation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
This varies heavily by company. I've worked at many places where MLE does this fwiw.
I added the persona to highlight explicitly how an ideal user should think about this workflow. Though maybe this could be amended to add more personas. I worry about the clarity though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that data preparation is made by data engineers, but considering we need show an e2e flow that covers all kubeflow components and we just brought spark operator to the ecosystem, we should cover data preparation too.
@rimolive @andreyvelich I am 100% with you on that and the answer to this depends on the org structure or the MLOps literature one follows. My question really is that from a tool/platform perspective, should we be putting personas on the documentation as a lot of it are grey areas. Also, given SparkOperator is fully onboard with Kubeflow, should we put that in the main architecture diagram or not? I have put this as a comment on the main PR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective this is out of scope of this PR. This PR is initial change to the architecture page to make sure our lifecycle diagrams represent up do date version of Kubeflow components.
Also, CNCF white paper already has personas explanation which might be useful for orgs who are looking for Kubernetes as primary platform for AI/Ml infra: https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf
cc @zanetworker @ronaldpetty @raravena80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also suggest splitting the Model Serving box in two i.e. Model Serving and ModelMonitoring/Drift detection as KServe has components to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. Model Monitoring, Drift Detection is part of model serving from my point of view. If we want to split this block, we should say: Online Inference vs Batch Inference, but I am not sure if we need to explain such details.
It's like with Spark, you can do Data Ingestion, Data Processing, Feature Engineering, etc., but we haven't explained everything in this lifecycle diagram.
I hope that more detailed diagrams can be showed in the KServe docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich as a consultant I can vouch that not many people know that kserve has drift detection capabilities and hence m request to put it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, that is why they should explore individual components docs for it.
E.g. if you know that you need the model serving component for your AI/ML infra, you will explore the KServe docs.
It is just impossible to show everything in this end-to-end ML lifecycle diagram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 40, the definition for Data prepartion can be reworded to say that
In the Data Preparation step you ingest/raw data and transfer it to perform feature engineering to extract ML features for the offline feature store, and prepare training data for model development. Usually, this step is associated with data processing tools such as Spark, Dask, Flink, or Ray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by you ingest/raw data raw data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry thats was a typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, the idea of this statement is to say that you use Spark to inject raw data and process it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
My suggestion will be to remove the ML Engineer Persona or show other personas as well
Also, I would also suggest splitting the Model Serving box in two i.e. Model Serving and ModelMonitoring/Drift detection as KServe has components to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
This varies heavily by company. I've worked at many places where MLE does this fwiw.
I added the persona to highlight explicitly how an ideal user should think about this workflow. Though maybe this could be amended to add more personas. I worry about the clarity though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franciscojavierarceo these are my thoughts as well as this gets political with who does what as there is no simple answer hence I was wondering if we should get into personas at all or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that make sense. I definitely understand how it can be a rabbit hole. I am generally customer-centric so my goal was really to just elicit the value-prop for people who are quickly thinking "why should I, as someone who builds models, care about kubeflow?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the main goal and motivation of this page is to explain the value of Kubeflow ecosystem to our users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
My suggestion will be to remove the ML Engineer Persona or show other personas as well
Also, I would also suggest splitting the Model Serving box in two i.e. Model Serving and ModelMonitoring/Drift detection as KServe has components to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue here is the lines are blurred, and there is no prescriptive authority as to how this works. What I would do is call that out. "To scale, you have to specialize," but right now MLOPs (and Kubeflow) are incubating, so the average user wears many hats. If an MLE wants to do data prep or a data engineer or a computer engineer nothing stops them if they aren't leaving other work untouched. Ultimately this is a business and engineering mgmt conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @chasecadet
As mentioned in another comment, I've worked at several places where the MLE was responsible for all if this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chasecadet @franciscojavierarceo the question is not who does what as it is very subjective, the question is that should we get into personas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense.
Really I just wanted to provide high level clarity about the value proposition of Kubeflow for MLEs or data scientists or whatever they're called this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikas-saxena02 @franciscojavierarceo THIS IS GREAT. So here is the philosophical/KF values questions. My biggest power as a solutions architect is saying " My customers commonly do XYZ". So we need to decide are we doing this as a text book style "this is the world we live in" where we need to point to an authority (@andreyvelich and I were discussing "who's ML Lifecycle are we referencing") or do we make this more community and experience based where we say "We commonly see MLEs within the Kubeflow community leverage these tools aligned to what we have defined as the ML lifecycle based on community feedback Etc... Andrey was mentioning the ML lifecycle we are using was sourced from the CNCF white paper by other professionals who worked to define it. That is totally fine but we need to give the lineage of our information, call out when it can be considered subjective, and also flavor what we are defining as something based on what we have seen in and agreed upon our community ( something that is powerful but is not necessarily the be all end all) and how new users can align themselves to it. We can also provide a place to discuss and challenge our ML lifecycle opinions but if we say "we commonly see data engineers using X" then its not necessarily us telling you what to do, but mentioning what we have seen so far and opening the door to new perspectives. This also helps us stay out of peoples scopes if they say "well the KF community said that this is an MLE tool so I didn't use it for data engineering and/or told off my data engineer". We have to be careful when we are being perscriptive because we could be liable and lose credibility as a community. If this is our "current world view open for discussion/growth" we invite discussion and contribution instead of enforcing our world view. Now that being said, we can 1000% defend our view point as we continue to gather data and understand how organizations do MLOPs with KF and not just let anyone reinvent the lifecycle, but still keep the door open in case someone does have something the community can discuss as a view point that makes sense to adopt or call out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's a great idea to call out that in practice the lines end up blurry between DE/MLE/DS for some orgs versus others.
I definitely welcome feedback and iteration on this! I think having this guidance is very useful though as it can provide a lot more clarity to the end user involved on why an MLOps team maybe recommending Kubeflow.
Andrej and I drafted this based on the CNCF diagram and modified it a little bit but, again, the language around personas across the industry is pretty fuzzy so I think sharing it with an asterisk is very helpful. It would also be valuable to hiring managers/executives that are trying to make staffing decisions but may not have the nuanced view of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with these points @chasecadet, but again it is out of scope of this PR.
This PR just explains the value of Kubeflow components in ML lifecycle, and of course you can integrate other components from AI/ML landscape to your AI/ML infra.
We can always iterate and improve our architecture page if we agree with the Kubeflow community.
@andreyvelich my two cents:
Happy to help with making the changes if you need some help. |
We will include Spark Operator + Model Registry in this diagram once we make the first official release for these components. |
I'm just adding some details here. I have a ton of content around the ML Lifecycle we can use from the course, and it's free. I own it. https://docs.google.com/document/d/1t2gTTQolI7DfLQJUbhSqd8bxhrIVqZOIU8dKGiTrHoo/edit?usp=sharing @andreyvelich @StefanoFioravanzo, feel free to take a look and see what we can use. I included model monitoring as part of serving and also mentioned model retiring. |
also @andreyvelich keep me posted on this. I can update the course with our official ML lifecycle as well as updated architecture diagrams. |
That's great @chasecadet, it would be nice if you could present it sometime in one of our communities call and collect the feedback. |
@franciscojavierarceo @thesuperzapper @vikas-saxena02 @chasecadet @StefanoFioravanzo @hbelmiro @kubeflow/kubeflow-steering-committee I think, we can merge this PR if you don't have any strong objections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
@andreyvelich no strong objection. Just another recommendation to add the CNCF paper as a reference. |
/approve |
@andreyvelich While we can always make improvements (and I am sure we will in future PRs) this update is a significant improvement to the architecture page and I think it's worth merging now. /lgtm @andreyvelich you will probably need to approve this, as it needs a root approver given the number of files changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks everyone for your review, and I am looking forward to share this with our users.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, franciscojavierarceo, vikas-saxena02 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Based on our recent discussion with @franciscojavierarceo I updated the ML lifecycle diagram in the architecture guides: #3719 (comment)
We can re-use this ML lifecycle diagram in each Kubeflow Component and explain the user value of that component.
I like the existing diagrams, but they little bit out of date.
I am happy to improve my diagrams based on your feedback.
Also, I removed unused images.
/assign @franciscojavierarceo @kubeflow/kubeflow-steering-committee @thesuperzapper @StefanoFioravanzo @hbelmiro
/hold for review