Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Conformance Program Doc for AutoML and Training WG #2048

Merged
merged 2 commits into from
Dec 8, 2022

Conversation

andreyvelich
Copy link
Member

Related: kubeflow/training-operator#1695, #2044.
Original doc: https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#.

I've added conformance doc for CRD-based test for AutoML and Training WG.
Please take a look.

/assign @james-jwu @johnugeorge @tenzen-y @jbottum @anencore94

cc other WGs
@kubeflow/wg-training-leads
@kubeflow/wg-pipeline-leads
@kubeflow/wg-notebooks-leads
@kubeflow/wg-manifests-leads

@andreyvelich
Copy link
Member Author

/hold for the review

That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.

- We are going to use
[the unify Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'unify' -> 'unified'

@james-jwu
Copy link

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Dec 1, 2022
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for leading this effort!

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


## Kubeflow Conformance

Kubeflow conformance consists the 3 category of tests:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially the Kubeflow conformance will include CRD based tests. In the future, API and UI based tests may be added.

In the following versions, we should design conformance program for the
Katib API-based tests.

- CRD-based tests
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be the 1st bullet in this list before API and UI tests.

- The tests should cover basic functionality of Katib and the Training Operator.
It will not cover all features.
- The tests are expected to evolve in the future versions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The tests should have a well documented and short list of set-up requirements.
  • The tests should install and complete in a relatively short period of time (< 30 minutes) with suggested minimum infrastructure requirements i.e. 3 nodes, 24 vcpu, 64 GB RAM, 500 GB Disk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we can achieve < 30 minutes requirement.
If we are going to run more than 1 Katib Experiment in the future, we might need more time. WDYT @johnugeorge ?
What about Pipelines team @james-jwu ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more idea - The Katib and Training Operator configuration and tests should make attempts to be integrated with the Pipeline configuration and test configuration. (My point is that we should try to minimize the conformance testing configuration and resource requirements if/when possible).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline requirement is relatively light. See the below in setup.yaml:
cpu: "2"
memory: 2Gi
requests.storage: "5Gi"

It's been a while since I last ran the Pipeline tests, but they are quite fast (<15 min for sure).

How long does the current Katib and Training tests run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@james-jwu Is resources a mandatory requirement ? We have been running Katib deployment + tests on Github CI which has 2-core CPU and 7G memory. Since allocated resources are bit tight, we have seen that certain runs have exceeded 30 min limit. However, if we have slightly more CPU resources, we can get it in 30 min easily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the katib's hyperparameter searching doesn't care much how the each training step goes on actually, we could set very-small epochs or very-small nueral network for conformance test's experiments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 1st version I think it is okay to require more resources. Jaeyeon's suggestion also sounds great.

- The above report can be downloaded from the test deployment by running `make report`.

- When all reports have been collected, the distributions are going to create PR
to publish the reports. The Kubeflow Conformance Committee will verify it and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to publish the reports and to update the appropriate Kubeflow.org web pages on conformant Kubeflow distributions.

@google-oss-prow google-oss-prow bot removed the lgtm label Dec 2, 2022
@andreyvelich
Copy link
Member Author

Thank you for the review @jbottum @james-jwu. I addressed your points.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this proposal!
LGTM

@jbottum
Copy link

jbottum commented Dec 7, 2022

LGTM, thanks!

@andreyvelich
Copy link
Member Author

I guess, all the comments have been addressed. Thanks for the review!
If you are ok with the proposal, please leave your /lgtm and we will merge the PR.

@tenzen-y
Copy link
Member

tenzen-y commented Dec 7, 2022

@andreyvelich Thanks!
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Dec 7, 2022
@jbottum
Copy link

jbottum commented Dec 7, 2022

/lgtm

@andreyvelich
Copy link
Member Author

Thanks everyone for the review, looking forward for our next steps.
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 87b7e7d into kubeflow:master Dec 8, 2022
@andreyvelich andreyvelich deleted the add-conformance-doc branch December 8, 2022 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants