Skip to content
This repository has been archived by the owner on Sep 5, 2019. It is now read-only.

Add build timeout #315

Merged
merged 11 commits into from
Aug 27, 2018
Merged

Conversation

shashwathi
Copy link
Contributor

Fixes #254

Proposed Changes

  • Introduce build timeout under timeout
  • Default timeout is 10min.

@shashwathi shashwathi changed the title Build timeout Add Build timeout Aug 27, 2018
@shashwathi shashwathi changed the title Add Build timeout Add build timeout Aug 27, 2018
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for adding this! 👍

Status BuildStatus `json:"status"`
Spec BuildSpec `json:"spec"`
Status BuildStatus `json:"status"`
Timeout string `json:"timeout,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want Timeout to exist inside of BuildSpec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. 👍

@@ -69,6 +69,14 @@ func (op *operation) Checkpoint(status *v1alpha1.BuildStatus) error {
return nil
}

func (op *operation) Terminate() error {
err := op.builder.kubeclient.CoreV1().Pods(op.namespace).Delete(op.name, &metav1.DeleteOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return op.builder.kubeclient.CoreV1().Pods(op.namespace).Delete(op.name, &metav1.DeleteOptions{})

if err := c.waitForOperationAsync(build, op); err != nil {
return err

// Check if build has timed out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will syncHandler be called periodically? Or only when there are updates to the build object? If it's periodic, how often is it called?

I worry that if this code is only run when a build updates (i.e., the pod's status changes), then a build that specifies one hour-long sleep step with a 10m timeout will not have any occasion to check for timeout until that step finishes, at which point the build has long exceeded its timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes syncHandler is called periodically. I need to do some research to find how often its called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From logs it appears 30 sec is the sync interval

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should document that somewhere, that we effectively only enforce timeouts to a ~30s resolution. Maybe we should also Watch the pod, with a timeout? If we do that, do we need to worry about the controller job restarting while builds are ongoing, and losing state? (All questions for future work, not blocking this PR)

//cleanup operation and update status
timeoutMsg := fmt.Sprintf("Build %q failed to finish within %q", build.Name, build.Timeout)

err = op.Terminate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where the method only returns an err, I tend to prefer:

if err := op.Terminate(); err != nil {
  c.logger.Error(...)
  return err
}

- name: step1
image: ubuntu
command: ["/bin/bash"]
args: ["-c", "sleep 2000"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, so this test indicates that syncHandler must be called periodically, or at least not only when steps start/finish. Do we have any idea when or how often exactly syncHandler is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 seconds is the sync interval

metadata:
name: test-with-low-timeout
labels:
expect: failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably another case where our integration tests can be improved, but it would be great if we could have expect: timeout and have e2e-test.sh assert not only that the test failed, but that it failed due to timeout specifically. If the build was changed to exit 1 the test would still pass. 😕

Anyway, not necessary for this PR, but something we should think about for future integration testing improvements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah failed is very generic and asserting on specific error would be ideal. I added controller level test for timeout which triggers syncHandler twice manually.

@@ -71,6 +71,10 @@ type BuildSpec struct {
// More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`

// Time after which the build times out.
// Defaults to 10minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: 10 minutes.

This should maybe also link to Go's ParseDuration documentation to clarify the expected format: https://golang.org/pkg/time/#ParseDuration

Should we enforce a maximum timeout too? Today, GCB limits builds to 24h for our own implementation purposes, though I suppose there's no strong reason to carry that limitation over to Knative Build. OTOH, it's easier to remove a restriction than to add one later. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of maximum timeout. Let me incorporate this into validation of build.

@@ -70,6 +75,25 @@ func IsDone(status *v1alpha1.BuildStatus) bool {
return false
}

// IsTimedOut returns true if the build's status indicates that build is timedout.
func IsTimeout(status *v1alpha1.BuildStatus, buildTimeout string) bool {
defaultTimeout := "10m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler to have the default already stated in terms of a time.Duration:

var timeout time.Duration
if buildTimeout == "" {
  timeout = 10*time.Minute
} else {
  timeout, err = time.ParseDuration(buildTimeout)
  if err != nil {
    return err
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like this. Much simpler

@shashwathi
Copy link
Contributor Author

shashwathi commented Aug 27, 2018

/assign @pivotal-nader-ziada
/assign @imjasonh

Message: message,
}}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using the http package and error to indicate an timeout from an external server? Does the other status errors include an http error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status error package includes client timeouts but in this case I wanted to throw custom timeout error with more detail about build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure what benefit using a StatusError gives us, can we just return fmt.Errorf("Build timed out after %s", timeout)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am good with that

t.Errorf("Expected no error while terminating operation")
}
// Verify pod is not available
_, err = podsclient.Get(op.Name(), metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This can be combined:

if _, err := podsclient.Get(...); err == nil {
  t.Fatalf(...)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to have an obsession with this pattern :P I will be more mindful of this going forward. Thanks for your patience

if err := c.waitForOperationAsync(build, op); err != nil {
return err

// Check if build has timed out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should document that somewhere, that we effectively only enforce timeouts to a ~30s resolution. Maybe we should also Watch the pod, with a timeout? If we do that, do we need to worry about the controller job restarting while builds are ongoing, and losing state? (All questions for future work, not blocking this PR)

Message: message,
}}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure what benefit using a StatusError gives us, can we just return fmt.Errorf("Build timed out after %s", timeout)?

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-build-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/builder/cluster/builder.go 78.4% 78.6% 0.2
pkg/builder/nop/builder.go 82.6% 79.2% -3.4
pkg/controller/build/controller.go 48.3% 50.0% 1.7
pkg/webhook/build.go 88.8% 89.9% 1.1

@shashwathi
Copy link
Contributor Author

@imjasonh

Hmm, we should document that somewhere, that we effectively only enforce timeouts to a ~30s resolution.

I will submit PR for docs.

Maybe we should also Watch the pod, with a timeout? If we do that, do we need to worry about the controller job restarting while builds are ongoing, and losing state?

I did consider the idea of Watching the Pod with timeout but that would require huge refactor and I wanted to keep the scope small.
I can follow up with a POC PR to get some feedback on Watch. We can use callbacks map to track the Watch object for builds to avoid losing state. We need to be cautious of not resetting channel also.

@nader-ziada
Copy link
Member

/lgtm

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ImJasonH, shashwathi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot merged commit b0143a0 into knative:master Aug 27, 2018
@shashwathi shashwathi deleted the build-timeout branch October 9, 2018 18:52
vdemeester pushed a commit to vdemeester/knative-build that referenced this pull request Apr 3, 2019
* Introduce build timeout

* Update status for build objects

* Update error msg format

* Remove debug statements

* Update to error msg

* Address comments

* Add test to check timeout error flow

* Add more tests and simplify default timeout

* Introduce maximum timeout

* Fix typo

* Address comments
vdemeester pushed a commit to vdemeester/knative-build that referenced this pull request Apr 3, 2019
* Introduce build timeout

* Update status for build objects

* Update error msg format

* Remove debug statements

* Update to error msg

* Address comments

* Add test to check timeout error flow

* Add more tests and simplify default timeout

* Introduce maximum timeout

* Fix typo

* Address comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants