Add Conformance Program Doc for AutoML and Training WG #2048

andreyvelich · 2022-12-01T19:47:17Z

Related: kubeflow/training-operator#1695, #2044.
Original doc: https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#.

I've added conformance doc for CRD-based test for AutoML and Training WG.
Please take a look.

/assign @james-jwu @johnugeorge @tenzen-y @jbottum @anencore94

cc other WGs
@kubeflow/wg-training-leads
@kubeflow/wg-pipeline-leads
@kubeflow/wg-notebooks-leads
@kubeflow/wg-manifests-leads

andreyvelich · 2022-12-01T19:50:45Z

/hold for the review

james-jwu · 2022-12-01T23:57:04Z

docs/proposals/conformance-test.md

+That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.
+
+- We are going to use
+  [the unify Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile)


'unify' -> 'unified'

james-jwu · 2022-12-01T23:58:47Z

/lgtm

terrytangyuan

LGTM. Thanks for leading this effort!

google-oss-prow · 2022-12-02T00:41:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jbottum · 2022-12-02T01:36:37Z

docs/proposals/conformance-test.md

+
+## Kubeflow Conformance
+
+Kubeflow conformance consists the 3 category of tests:


Initially the Kubeflow conformance will include CRD based tests. In the future, API and UI based tests may be added.

jbottum · 2022-12-02T01:37:22Z

docs/proposals/conformance-test.md

+  In the following versions, we should design conformance program for the
+  Katib API-based tests.
+
+- CRD-based tests


this should be the 1st bullet in this list before API and UI tests.

jbottum · 2022-12-02T01:44:24Z

docs/proposals/conformance-test.md

+- The tests should cover basic functionality of Katib and the Training Operator.
+  It will not cover all features.
+- The tests are expected to evolve in the future versions.
+


The tests should have a well documented and short list of set-up requirements.

The tests should install and complete in a relatively short period of time (< 30 minutes) with suggested minimum infrastructure requirements i.e. 3 nodes, 24 vcpu, 64 GB RAM, 500 GB Disk.

I am not sure that we can achieve < 30 minutes requirement.
If we are going to run more than 1 Katib Experiment in the future, we might need more time. WDYT @johnugeorge ?
What about Pipelines team @james-jwu ?

One more idea - The Katib and Training Operator configuration and tests should make attempts to be integrated with the Pipeline configuration and test configuration. (My point is that we should try to minimize the conformance testing configuration and resource requirements if/when possible).

Pipeline requirement is relatively light. See the below in setup.yaml:
cpu: "2"
memory: 2Gi
requests.storage: "5Gi"

It's been a while since I last ran the Pipeline tests, but they are quite fast (<15 min for sure).

How long does the current Katib and Training tests run?

@james-jwu Is resources a mandatory requirement ? We have been running Katib deployment + tests on Github CI which has 2-core CPU and 7G memory. Since allocated resources are bit tight, we have seen that certain runs have exceeded 30 min limit. However, if we have slightly more CPU resources, we can get it in 30 min easily.

Also, the katib's hyperparameter searching doesn't care much how the each training step goes on actually, we could set very-small epochs or very-small nueral network for conformance test's experiments.

For the 1st version I think it is okay to require more resources. Jaeyeon's suggestion also sounds great.

jbottum · 2022-12-02T01:48:31Z

docs/proposals/conformance-test.md

+- The above report can be downloaded from the test deployment by running `make report`.
+
+- When all reports have been collected, the distributions are going to create PR
+  to publish the reports. The Kubeflow Conformance Committee will verify it and


to publish the reports and to update the appropriate Kubeflow.org web pages on conformant Kubeflow distributions.

andreyvelich · 2022-12-02T12:37:27Z

Thank you for the review @jbottum @james-jwu. I addressed your points.

tenzen-y

@andreyvelich Thanks for this proposal!
LGTM

docs/proposals/conformance-test.md

jbottum · 2022-12-07T01:06:52Z

LGTM, thanks!

andreyvelich · 2022-12-07T14:42:09Z

I guess, all the comments have been addressed. Thanks for the review!
If you are ok with the proposal, please leave your /lgtm and we will merge the PR.

tenzen-y · 2022-12-07T14:43:17Z

@andreyvelich Thanks!
/lgtm

jbottum · 2022-12-07T20:23:38Z

/lgtm

andreyvelich · 2022-12-08T13:34:36Z

Thanks everyone for the review, looking forward for our next steps.
/hold cancel

google-oss-prow bot assigned anencore94, james-jwu, jbottum, johnugeorge and tenzen-y Dec 1, 2022

google-oss-prow bot requested review from anencore94 and sperlingxx December 1, 2022 19:47

google-oss-prow bot added approved size/L labels Dec 1, 2022

Add Conformance Program Doc for AutoML and Training WG

22e4467

andreyvelich force-pushed the add-conformance-doc branch from 7b3c450 to 22e4467 Compare December 1, 2022 19:50

google-oss-prow bot added the do-not-merge/hold label Dec 1, 2022

james-jwu reviewed Dec 1, 2022

View reviewed changes

google-oss-prow bot added the lgtm label Dec 1, 2022

terrytangyuan approved these changes Dec 2, 2022

View reviewed changes

jbottum reviewed Dec 2, 2022

View reviewed changes

google-oss-prow bot removed the lgtm label Dec 2, 2022

Address Review Comments

604cf2c

andreyvelich force-pushed the add-conformance-doc branch from 46a5be1 to 604cf2c Compare December 2, 2022 12:41

tenzen-y reviewed Dec 2, 2022

View reviewed changes

anencore94 reviewed Dec 5, 2022

View reviewed changes

docs/proposals/conformance-test.md Show resolved Hide resolved

google-oss-prow bot added the lgtm label Dec 7, 2022

google-oss-prow bot removed the do-not-merge/hold label Dec 8, 2022

google-oss-prow bot merged commit 87b7e7d into kubeflow:master Dec 8, 2022

andreyvelich deleted the add-conformance-doc branch December 8, 2022 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Conformance Program Doc for AutoML and Training WG #2048

Add Conformance Program Doc for AutoML and Training WG #2048

andreyvelich commented Dec 1, 2022

andreyvelich commented Dec 1, 2022

james-jwu Dec 1, 2022

james-jwu commented Dec 1, 2022

terrytangyuan left a comment •

edited

Loading

google-oss-prow bot commented Dec 2, 2022

jbottum Dec 2, 2022

jbottum Dec 2, 2022

jbottum Dec 2, 2022

andreyvelich Dec 2, 2022

jbottum Dec 2, 2022

james-jwu Dec 2, 2022

johnugeorge Dec 3, 2022

anencore94 Dec 5, 2022

james-jwu Dec 5, 2022

jbottum Dec 2, 2022

andreyvelich commented Dec 2, 2022

tenzen-y left a comment

jbottum commented Dec 7, 2022

andreyvelich commented Dec 7, 2022

tenzen-y commented Dec 7, 2022

jbottum commented Dec 7, 2022

andreyvelich commented Dec 8, 2022


		## Kubeflow Conformance

		Kubeflow conformance consists the 3 category of tests:

Add Conformance Program Doc for AutoML and Training WG #2048

Add Conformance Program Doc for AutoML and Training WG #2048

Conversation

andreyvelich commented Dec 1, 2022

andreyvelich commented Dec 1, 2022

Choose a reason for hiding this comment

james-jwu commented Dec 1, 2022

terrytangyuan left a comment • edited Loading

Choose a reason for hiding this comment

google-oss-prow bot commented Dec 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Dec 2, 2022

tenzen-y left a comment

Choose a reason for hiding this comment

jbottum commented Dec 7, 2022

andreyvelich commented Dec 7, 2022

tenzen-y commented Dec 7, 2022

jbottum commented Dec 7, 2022

andreyvelich commented Dec 8, 2022

terrytangyuan left a comment •

edited

Loading