Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubble up pod schedule errors to revision status #4191

Merged
merged 1 commit into from
Jun 3, 2019

Conversation

shashwathi
Copy link
Contributor

Pod schedule error information is updated in revision status so user can debug what's happening from the knative revision, configuration and svc object.

Eg:
Sample configuration status (with pod spec that has unschedulable resource )

    conditions:
    - lastTransitionTime: 2019-05-30T18:42:45Z
      message: 'Revision "configuration-example-jmk22" failed with message: 0/3 nodes
        are available: 3 Insufficient cpu..'
      reason: RevisionFailed
      status: "False"
      type: Ready

Revision status

   - lastTransitionTime: 2019-05-30T18:42:45Z
      message: '0/3 nodes are available: 3 Insufficient cpu.'
      reason: Unschedulable
      status: "False"
      type: ResourcesAvailable

Fixes #4153

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 30, 2019
@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 30, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashwathi: 1 warning.

In response to this:

Pod schedule error information is updated in revision status so user can debug what's happening from the knative revision, configuration and svc object.

Eg:
Sample configuration status (with pod spec that has unschedulable resource )

   conditions:
   - lastTransitionTime: 2019-05-30T18:42:45Z
     message: 'Revision "configuration-example-jmk22" failed with message: 0/3 nodes
       are available: 3 Insufficient cpu..'
     reason: RevisionFailed
     status: "False"
     type: Ready

Revision status

  - lastTransitionTime: 2019-05-30T18:42:45Z
     message: '0/3 nodes are available: 3 Insufficient cpu.'
     reason: Unschedulable
     status: "False"
     type: ResourcesAvailable

Fixes #4153

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -197,6 +197,10 @@ func (rs *RevisionStatus) MarkResourcesAvailable() {
revCondSet.Manage(rs).MarkTrue(RevisionConditionResourcesAvailable)
}

func (rs *RevisionStatus) MarkResourcesUnavailable(reason, message string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported method RevisionStatus.MarkResourcesUnavailable should have comment or be unexported. More info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this as well.

@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label May 30, 2019
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 84.4% 0.2
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

@jonjohnsonjr
Copy link
Contributor

Nice, I'll need MarkResourcesUnavailable for #4192

@dprotaso
Copy link
Member

/lgtm

@shashwathi after rebasing/resolving conflict you'll need to assign this to Matt because of the API package changes

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 31, 2019
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label May 31, 2019
@shashwathi
Copy link
Contributor Author

/assign @mattmoor

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 90.0% -1.4

Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot knative-prow-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 3, 2019
@mattmoor
Copy link
Member

mattmoor commented Jun 3, 2019

/hold

Is an e2e test for this possible? If not, feel free to remove the hold.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2019
@mattmoor
Copy link
Member

mattmoor commented Jun 3, 2019

Perhaps you could ask for an unreasonably large amount of CPU? Like 50k cores :)

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 3, 2019
@shashwathi
Copy link
Contributor Author

@mattmoor : I have updated the PR to include e2e test with high CPU like you recommended :) Thanks for the suggestion.

@shashwathi
Copy link
Contributor Author

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2019
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

@shashwathi
Copy link
Contributor Author

/test pull-knative-serving-unit-tests

@@ -0,0 +1,98 @@
package e2e
Copy link
Contributor

@jonjohnsonjr jonjohnsonjr Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs:

// +build e2e

/*
Copyright 2019 The Knative Authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much

break
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we care about overriding this status with the statuses below if they all happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean updating revision status if resources are present on nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NM, I think it's fine. But what I meant is code below would override the status value. But we don't have idea of priorities, so I guess it doesn't matter here.


names.Config = serviceresourcenames.Configuration(svc)

errorReason := "RevisionFailed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those both are constants.

t.Fatalf("Failed to get revision from configuration %s: %v", names.Config, err)
}

revisionReason := "Unschedulable"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as are these 2.

}

// Get revision name from configuration.
func getRevisionFromConfiguration(clients *test.Clients, configName string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just revisionFromConfiguration.

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superficial issues only

revisionName, revisionReason, errorMsg, cond.Reason, cond.Message)
}
return false, nil
}, "RevisionFailed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use errorReason here?

if config.Status.LatestCreatedRevisionName != "" {
return config.Status.LatestCreatedRevisionName, nil
}
return "", fmt.Errorf("No valid revision name found in configuration %s", configName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors start with lowercase in Go.

Suggested change
return "", fmt.Errorf("No valid revision name found in configuration %s", configName)
return "", fmt.Errorf("no valid revision name found in configuration %s", configName)

if cond.Reason == revisionReason && strings.Contains(cond.Message, errorMsg) {
return true, nil
}
return true, fmt.Errorf("The revision %s was not marked with expected error condition (Reason=%q, Message=%q), but with (Reason=%q, Message=%q)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return true, fmt.Errorf("The revision %s was not marked with expected error condition (Reason=%q, Message=%q), but with (Reason=%q, Message=%q)",
return true, fmt.Errorf("the revision %s was not marked with expected error condition (Reason=%q, Message=%q), but with (Reason=%q, Message=%q)",

@shashwathi
Copy link
Contributor Author

shashwathi commented Jun 3, 2019

@vagababov Addressed your comments. Ready for another review

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do the linter error and found one more error string :)
But I am sure those are the last ones :)

@@ -197,6 +197,10 @@ func (rs *RevisionStatus) MarkResourcesAvailable() {
revCondSet.Manage(rs).MarkTrue(RevisionConditionResourcesAvailable)
}

func (rs *RevisionStatus) MarkResourcesUnavailable(reason, message string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this as well.

return true, nil
}
t.Logf("Reason: %s ; Message: %s ; Status: %s", cond.Reason, cond.Message, cond.Status)
return true, fmt.Errorf("The service %s was not marked with expected error condition (Reason=\"%s\", Message=\"%s\", Status=\"%s\"), but with (Reason=\"%s\", Message=\"%s\", Status=\"%s\")",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return true, fmt.Errorf("The service %s was not marked with expected error condition (Reason=\"%s\", Message=\"%s\", Status=\"%s\"), but with (Reason=\"%s\", Message=\"%s\", Status=\"%s\")",
return true, fmt.Errorf("the service %s was not marked with expected error condition (Reason=\"%s\", Message=\"%s\", Status=\"%s\"), but with (Reason=\"%s\", Message=\"%s\", Status=\"%s\")",

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
Thanks!

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, shashwathi, vagababov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Surface pod scheduling errors in service status
8 participants