Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start of the e2e work #244

Closed
wants to merge 40 commits into from
Closed

Conversation

NikeNano
Copy link
Member

@NikeNano NikeNano commented May 3, 2020

This PR/draft includes initial work to set up the e2e test for the mpi-operator following the |Kubeflow/testing guidelines](https://github.com/kubeflow/testing).

Related to issue: #233

The testing infrastructure is sett up in the following way(the most compete repo with the latest updates seems to be Kubeflow/examples which will be used for examples):

  1. Webhooks between Github and Prow which responds to GitHub events(this should be setup for all Kubeflow repos)
  2. Prow jobs defined in the prow_config.yam that execute Argo workflows based upon python scripts example here.
  3. The logs are written back to GCP buckets from where they are snapped up by Gubernator (as I understand at least).

The current PR includes the scripts to run test but not the actually e2e test that should/could be based upon the tensor flow-benchmark.yaml

For the tf-operator the e2e test can be found here as an example. It is built around a simple tf_job_client.py. One approach would be to follow this set up for inspiration. However it should be noted that the tf-operator repo test infra is built around ksonnet which is deprecated for pipelines(read argo workflows) written in python, info here.

I have played around with the current code to get it up and running currently only doing linting on a K8s cluster. I want to start with knowing that the basic infra is working and know how to rung things. However I can't managed to run it end 2 end on my own cluster yet. Due to issues with :

  • NFS that needs to be configured
  • Issues with the script checking out the code. Might be related to that I run it on my own cluster.

I will continue to use the Kubeflow infrastructure to develop this since I have issues with setting up my own and it gets picked up without being merged.

I will continue to look in to this but would be happy for help/suggestions in order to speed up the process.

@kubeflow-bot
Copy link

This change is Reviewable

@NikeNano NikeNano marked this pull request as draft May 3, 2020 16:24
@NikeNano
Copy link
Member Author

NikeNano commented May 3, 2020

Ahh cool! It picks up the tests and run them :)

@NikeNano
Copy link
Member Author

NikeNano commented May 3, 2020

As discussed @terrytangyuan here are my initial work. I will continue to work on it but don't know when it will be done :(

@NikeNano
Copy link
Member Author

NikeNano commented May 3, 2020

Will continue to work on the actual test now when everything around seems to work!

@terrytangyuan
Copy link
Member

Great work! Thanks!

@NikeNano
Copy link
Member Author

NikeNano commented May 5, 2020

I hope to have something up after the weekend, have a lot at work this week.

@terrytangyuan
Copy link
Member

Great. Please ping me again when ready for review.

@jlewi
Copy link
Contributor

jlewi commented May 9, 2020

@NikeNano based on your comment in kubeflow/testing#656 it looks like you are using an auto-deployed cluster and following what the tests in kubeflow/examples are doing. This is fine if you aren't installing the operator itself and following what the examples do.

I don't think our auto-deployed clusters currently install the mpi-operator. So we might have to update the config to include the mpi operator if we want to use the auto-deployed clusters.

Did you consider using a notebook as both an example and as a test? e.g. you could create a dirt simple notebook like the mnist
https://github.com/kubeflow/examples/blob/master/mnist/mnist_gcp.ipynb

Which just creates an mpi-job and verifies it works. The advantage of this is that it serves both as a tutorial and the test verifies that the tutorial is working correctly.

We are in the process of adding support for Tekton to our CI system. See kubeflow/testing#622

One of the reasons we are adding support for Tekton is that hopefully it will make it easier to define and add new tests. For example, if you look at that PR it includes a Tekton task for running a notebook. So hopefully once we get all of that merged it will be dirt simple to add new tests to run Tekton tests.

That said we could use help getting the PRs for Tekton support finished and merged. Let me know if that was something you would be interested in helping with.

The way that might work would be.

  1. Create a notebook or python program to run your test.
  2. Create a Tekton pipeline to run the test
  3. Integrate the Tekton pipeline into Kubeflow CI.

@NikeNano
Copy link
Member Author

Thanks @jlewi! I would be happy to help out with the Tekton integration.

I think it would be great to simplify the testing setup, I have struggled a bit to understand all the piece working together currently.

I will start with setting up a notebook and then help out with the Tekton work!

@NikeNano
Copy link
Member Author

@NikeNano I don't think the notebook tests are full working yet. At least i ran into various issues. I might suggest using job_types to run them as postsubmits/periodics so you start getting signal but don't make them blocking presubmits just yet.

I made some attempts yeasterday and I also had some issues. Thanks for the suggestion!

@NikeNano
Copy link
Member Author

NikeNano commented Jul 4, 2020

/retest

@terrytangyuan
Copy link
Member

@NikeNano Nice progress here! Seems like it's working?

@NikeNano
Copy link
Member Author

NikeNano commented Jul 7, 2020

@NikeNano Nice progress here! Seems like it's working?

I need to make sure the actual test works as well now, currently running a dummy notebook, but the hardest part should be done :)

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign terrytangyuan
You can assign the PR to them by writing /assign @terrytangyuan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@terrytangyuan
Copy link
Member

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants