-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start of the e2e work #244
Conversation
Ahh cool! It picks up the tests and run them :) |
As discussed @terrytangyuan here are my initial work. I will continue to work on it but don't know when it will be done :( |
Will continue to work on the actual test now when everything around seems to work! |
Great work! Thanks! |
I hope to have something up after the weekend, have a lot at work this week. |
Great. Please ping me again when ready for review. |
@NikeNano based on your comment in kubeflow/testing#656 it looks like you are using an auto-deployed cluster and following what the tests in kubeflow/examples are doing. This is fine if you aren't installing the operator itself and following what the examples do. I don't think our auto-deployed clusters currently install the mpi-operator. So we might have to update the config to include the mpi operator if we want to use the auto-deployed clusters. Did you consider using a notebook as both an example and as a test? e.g. you could create a dirt simple notebook like the mnist Which just creates an mpi-job and verifies it works. The advantage of this is that it serves both as a tutorial and the test verifies that the tutorial is working correctly. We are in the process of adding support for Tekton to our CI system. See kubeflow/testing#622 One of the reasons we are adding support for Tekton is that hopefully it will make it easier to define and add new tests. For example, if you look at that PR it includes a Tekton task for running a notebook. So hopefully once we get all of that merged it will be dirt simple to add new tests to run Tekton tests. That said we could use help getting the PRs for Tekton support finished and merged. Let me know if that was something you would be interested in helping with. The way that might work would be.
|
Thanks @jlewi! I would be happy to help out with the Tekton integration. I think it would be great to simplify the testing setup, I have struggled a bit to understand all the piece working together currently. I will start with setting up a notebook and then help out with the Tekton work! |
I made some attempts yeasterday and I also had some issues. Thanks for the suggestion! |
/retest |
@NikeNano Nice progress here! Seems like it's working? |
I need to make sure the actual test works as well now, currently running a dummy notebook, but the hardest part should be done :) |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Closing due to inactivity. |
This PR/draft includes initial work to set up the e2e test for the mpi-operator following the |Kubeflow/testing guidelines](https://github.com/kubeflow/testing).
Related to issue: #233
The testing infrastructure is sett up in the following way(the most compete repo with the latest updates seems to be Kubeflow/examples which will be used for examples):
The current PR includes the scripts to run test but not the actually e2e test that should/could be based upon the tensor flow-benchmark.yaml
For the tf-operator the e2e test can be found here as an example. It is built around a simple tf_job_client.py. One approach would be to follow this set up for inspiration. However it should be noted that the tf-operator repo test infra is built around ksonnet which is deprecated for pipelines(read argo workflows) written in python, info here.
I have played around with the current code to get it up and running currently only doing linting on a K8s cluster. I want to start with knowing that the basic infra is working and know how to rung things. However I can't managed to run it end 2 end on my own cluster yet. Due to issues with:NFS that needs to be configuredIssues with the script checking out the code. Might be related to that I run it on my own cluster.I will continue to use the Kubeflow infrastructure to develop this since I have issues with setting up my own and it gets picked up without being merged.
I will continue to look in to this but would be happy for help/suggestions in order to speed up the process.