Start of the e2e work #244

NikeNano · 2020-05-03T16:23:51Z

This PR/draft includes initial work to set up the e2e test for the mpi-operator following the |Kubeflow/testing guidelines](https://github.com/kubeflow/testing).

Related to issue: #233

The testing infrastructure is sett up in the following way(the most compete repo with the latest updates seems to be Kubeflow/examples which will be used for examples):

Webhooks between Github and Prow which responds to GitHub events(this should be setup for all Kubeflow repos)
Prow jobs defined in the prow_config.yam that execute Argo workflows based upon python scripts example here.
The logs are written back to GCP buckets from where they are snapped up by Gubernator (as I understand at least).

The current PR includes the scripts to run test but not the actually e2e test that should/could be based upon the tensor flow-benchmark.yaml

For the tf-operator the e2e test can be found here as an example. It is built around a simple tf_job_client.py. One approach would be to follow this set up for inspiration. However it should be noted that the tf-operator repo test infra is built around ksonnet which is deprecated for pipelines(read argo workflows) written in python, info here.

I have played around with the current code to get it up and running currently only doing linting on a K8s cluster. I want to start with knowing that the basic infra is working and know how to rung things. However I can't managed to run it end 2 end on my own cluster yet. Due to issues with :

~~NFS that needs to be configured~~
~~Issues with the script checking out the code. Might be related to that I run it on my own cluster~~.

I will continue to use the Kubeflow infrastructure to develop this since I have issues with setting up my own and it gets picked up without being merged.

I will continue to look in to this but would be happy for help/suggestions in order to speed up the process.

kubeflow-bot · 2020-05-03T16:23:56Z

This change is

NikeNano · 2020-05-03T16:25:16Z

Ahh cool! It picks up the tests and run them :)

NikeNano · 2020-05-03T16:26:13Z

As discussed @terrytangyuan here are my initial work. I will continue to work on it but don't know when it will be done :(

NikeNano · 2020-05-03T19:24:30Z

Will continue to work on the actual test now when everything around seems to work!

terrytangyuan · 2020-05-03T19:25:06Z

Great work! Thanks!

NikeNano · 2020-05-05T20:15:38Z

I hope to have something up after the weekend, have a lot at work this week.

terrytangyuan · 2020-05-06T01:31:01Z

Great. Please ping me again when ready for review.

jlewi · 2020-05-09T23:55:24Z

@NikeNano based on your comment in kubeflow/testing#656 it looks like you are using an auto-deployed cluster and following what the tests in kubeflow/examples are doing. This is fine if you aren't installing the operator itself and following what the examples do.

I don't think our auto-deployed clusters currently install the mpi-operator. So we might have to update the config to include the mpi operator if we want to use the auto-deployed clusters.

Did you consider using a notebook as both an example and as a test? e.g. you could create a dirt simple notebook like the mnist
https://github.com/kubeflow/examples/blob/master/mnist/mnist_gcp.ipynb

Which just creates an mpi-job and verifies it works. The advantage of this is that it serves both as a tutorial and the test verifies that the tutorial is working correctly.

We are in the process of adding support for Tekton to our CI system. See kubeflow/testing#622

One of the reasons we are adding support for Tekton is that hopefully it will make it easier to define and add new tests. For example, if you look at that PR it includes a Tekton task for running a notebook. So hopefully once we get all of that merged it will be dirt simple to add new tests to run Tekton tests.

That said we could use help getting the PRs for Tekton support finished and merged. Let me know if that was something you would be interested in helping with.

The way that might work would be.

Create a notebook or python program to run your test.
Create a Tekton pipeline to run the test
Integrate the Tekton pipeline into Kubeflow CI.

NikeNano · 2020-05-10T10:01:42Z

Thanks @jlewi! I would be happy to help out with the Tekton integration.

I think it would be great to simplify the testing setup, I have struggled a bit to understand all the piece working together currently.

I will start with setting up a notebook and then help out with the Tekton work!

NikeNano · 2020-06-17T07:17:26Z

@NikeNano I don't think the notebook tests are full working yet. At least i ran into various issues. I might suggest using job_types to run them as postsubmits/periodics so you start getting signal but don't make them blocking presubmits just yet.

I made some attempts yeasterday and I also had some issues. Thanks for the suggestion!

NikeNano · 2020-07-04T12:06:14Z

/retest

terrytangyuan · 2020-07-06T15:00:25Z

@NikeNano Nice progress here! Seems like it's working?

NikeNano · 2020-07-07T18:52:00Z

@NikeNano Nice progress here! Seems like it's working?

I need to make sure the actual test works as well now, currently running a dummy notebook, but the hardest part should be done :)

k8s-ci-robot · 2020-08-19T15:02:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign terrytangyuan
You can assign the PR to them by writing /assign @terrytangyuan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan · 2021-06-22T17:40:56Z

Closing due to inactivity.

Start of the e2e work

82662eb

k8s-ci-robot requested review from carmark and gaocegege May 3, 2020 16:23

k8s-ci-robot added the size/XL label May 3, 2020

NikeNano marked this pull request as draft May 3, 2020 16:24

k8s-ci-robot added do-not-merge/work-in-progress size/L and removed size/XL labels May 3, 2020

Initial work with e2e test for the mpi-operator

927ca3c

NikeNano force-pushed the e2e_test_initial branch from f3124a4 to 927ca3c Compare May 3, 2020 19:19

k8s-ci-robot added size/XL and removed size/L labels May 3, 2020

NikeNano force-pushed the e2e_test_initial branch from 1be8ca9 to 927ca3c Compare May 8, 2020 07:13

added logging

fb2da14

NikeNano mentioned this pull request May 8, 2020

Issues with the get_kf_testing_clusterwhen using get-credentials for e2e test development for the mpi-operator. kubeflow/testing#656

Closed

NikeNano added 4 commits May 8, 2020 10:17

add kwargs

2634815

error fix

7c5ab97

back

d8af476

state version

5f511bf

added notebook

a2cd9c0

k8s-ci-robot removed the size/XL label May 13, 2020

dummy test

43dcc06

NikeNano added 4 commits July 4, 2020 15:46

updated the path

982ac45

updated test

53c16b6

changed notebook

f7ff4b6

removed old test tekton job

fea3d88

clean up

3d2ac5a

k8s-ci-robot added size/M and removed size/XXL labels Jul 10, 2020

added the test folder

01b559a

k8s-ci-robot added size/XXL and removed size/M labels Jul 10, 2020

NikeNano added 10 commits July 10, 2020 17:33

moved around

32d6a9d

update prow

10a67ad

updated now

10cb064

updated prow!

842bcd8

update setting

ee09b2a

added the test run image

f84f304

missed the docker image

068a69f

updated the readme

3addf75

updated notebook

e5cc0c3

updated readme, trigger test

9c6a5f1

NikeNano mentioned this pull request Jul 14, 2020

Test can't find namespace. kubeflow/testing#733

Closed

revert to working stage

d7b34a6

terrytangyuan closed this Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start of the e2e work #244

Start of the e2e work #244

NikeNano commented May 3, 2020 •

edited

Loading

kubeflow-bot commented May 3, 2020

NikeNano commented May 3, 2020

NikeNano commented May 3, 2020

NikeNano commented May 3, 2020

terrytangyuan commented May 3, 2020

NikeNano commented May 5, 2020

terrytangyuan commented May 6, 2020

jlewi commented May 9, 2020

NikeNano commented May 10, 2020

NikeNano commented Jun 17, 2020

NikeNano commented Jul 4, 2020

terrytangyuan commented Jul 6, 2020

NikeNano commented Jul 7, 2020

k8s-ci-robot commented Aug 19, 2020

terrytangyuan commented Jun 22, 2021

Start of the e2e work #244

Start of the e2e work #244

Conversation

NikeNano commented May 3, 2020 • edited Loading

kubeflow-bot commented May 3, 2020

NikeNano commented May 3, 2020

NikeNano commented May 3, 2020

NikeNano commented May 3, 2020

terrytangyuan commented May 3, 2020

NikeNano commented May 5, 2020

terrytangyuan commented May 6, 2020

jlewi commented May 9, 2020

NikeNano commented May 10, 2020

NikeNano commented Jun 17, 2020

NikeNano commented Jul 4, 2020

terrytangyuan commented Jul 6, 2020

NikeNano commented Jul 7, 2020

k8s-ci-robot commented Aug 19, 2020

terrytangyuan commented Jun 22, 2021

NikeNano commented May 3, 2020 •

edited

Loading