-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf_job_simple_test results not being report #1426
Comments
So test didn't create a new job; but updated an existing job; that's a bit strange. |
List of the XML files uploaded are here Looks like generate_xml might not be logging a message when it writes the output |
* This is intended to help debug kubeflow/kubeflow#1426 by showing whether generate_xml is called.
…ported. * Related to kubeflow#1426
* This is intended to help debug kubeflow/kubeflow#1426 by showing whether generate_xml is called.
After adding logging in #197 and #1427 We see the following logging output
So it looks like the xml file is named incorrectly and that probably prevents gubernator from detecting the test. |
* name is used in the XMl file containing the test results. If name isn't set the XML file won't be created correctly and therefore not surfaced in gubernator correctly; see kubeflow/kubeflow#1426 * Related to kubeflow/kubeflow#1426
* Fix kubeflow#1426 There are two problems with the test 1. Test isn't properly reporting results to gubernator; so test failures aren't being noticed. 2. Test needs to be updated to work with v1alpha2. * The TestSuite name needs to be set because this is used as the name of the junit XML file. * simple-prototype-test should set test_dir and artifacts_dir. * Fix the test; use tf_job_client to wait for the job to be in the Running condition. This should be more reliable than checking for actual pods. * The test has probably been broken for a while but this went unnoticed because results weren't being properly surfaced in test grid because the XML file is improperly named. I suspect things broke as part of the switch to v1alpha2 which changed the names of the pods.
* Fix #1426 There are two problems with the test 1. Test isn't properly reporting results to gubernator; so test failures aren't being noticed. 2. Test needs to be updated to work with v1alpha2. * The TestSuite name needs to be set because this is used as the name of the junit XML file. * simple-prototype-test should set test_dir and artifacts_dir. * Fix the test; use tf_job_client to wait for the job to be in the Running condition. This should be more reliable than checking for actual pods. * The test has probably been broken for a while but this went unnoticed because results weren't being properly surfaced in test grid because the XML file is improperly named. I suspect things broke as part of the switch to v1alpha2 which changed the names of the pods.
* name is used in the XMl file containing the test results. If name isn't set the XML file won't be created correctly and therefore not surfaced in gubernator correctly; see kubeflow/kubeflow#1426 * Related to kubeflow/kubeflow#1426
* This is intended to help debug kubeflow/kubeflow#1426 by showing whether generate_xml is called.
* name is used in the XMl file containing the test results. If name isn't set the XML file won't be created correctly and therefore not surfaced in gubernator correctly; see kubeflow/kubeflow#1426 * Related to kubeflow/kubeflow#1426
* Fix kubeflow#1426 There are two problems with the test 1. Test isn't properly reporting results to gubernator; so test failures aren't being noticed. 2. Test needs to be updated to work with v1alpha2. * The TestSuite name needs to be set because this is used as the name of the junit XML file. * simple-prototype-test should set test_dir and artifacts_dir. * Fix the test; use tf_job_client to wait for the job to be in the Running condition. This should be more reliable than checking for actual pods. * The test has probably been broken for a while but this went unnoticed because results weren't being properly surfaced in test grid because the XML file is improperly named. I suspect things broke as part of the switch to v1alpha2 which changed the names of the pods.
This will change the katib-controller and katib-ui roles to clusterroles. Additionally Dominik Fleischmann is being added to the owners of the katib operators.
* Migrate Istio and Dex to V3 * Roll back AWS change
Here's a postsubmit run
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubeflow_kubeflow/kubeflow-postsubmit/619
There are 10 passed tess
test_jsonnet
deploy-kubeflow-deploy_argo-test-argo-deploy
deploy-kubeflow-deploy_minikube
deploy-kubeflow-deploy_model-mnist-cpu
deploy-kubeflow-deploy_pytorchjob-pytorch-job
deploy-kubeflow-teardown
deploy-kubeflow-teardown_minikube
smoke-tfjob-gke
test_jsonnet_formatting test_jsonnet_formatting
tf-serving-image-mnist-cpu
1 Failed test
deploy-kubeflow-deploy_model-mnist-gpu
There is no report for the simple TFJob prototype test
Looking at Argo it looks like the test ran
http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-postsubmit-kfctl-a41ba72-619-0fc2?tab=workflow&nodeId=kubeflow-postsubmit-kfctl-a41ba72-619-0fc2-203334643
There's no indication that the test completed successfully; i.e.
https://github.com/kubeflow/kubeflow/blob/master/testing/tf_job_simple_test.py#L111
we should print out "TFJob launched successfully."
There's also no indication in the logs that we saved the results/failure to GCS as an example file
The text was updated successfully, but these errors were encountered: