adding timeout to alert.post and http_post #1462

sputnik13 · 2017-06-29T19:40:46Z

Adding timeout to alert.post and http_post nodes. Currently
http.DefaultClient is used, which blocks indefinitely when the endpoint
accepts the connection but does not respond to the HTTP request.
Switching to using a non-default http.Client with a timeout defined per
node.

Addresses #1461

Required for all non-trivial PRs

Rebased/mergable
Tests pass
CHANGELOG.md updated

nathanielc · 2017-07-24T20:06:39Z

@sputnik13 This looks good, I see one issue. The Update function does not propagate a new Timeout value to existing handlers since the time out is set only when the handler is created. One possible change is to use context.WithTimeout on the request.WithContext method to apply a timeout per request so that client need not be destroyed and recreated with changes to the handler config.

nathanielc · 2017-07-24T20:07:08Z

@desa Could you also review this PR?

desa · 2017-07-24T20:07:53Z

On it.

sputnik13 · 2017-07-24T20:09:11Z

@nathanielc got it, I'll take a look and push up an update

desa · 2017-07-24T20:49:00Z

services/httppost/service.go

@@ -207,6 +210,7 @@ func (s *Service) Handler(c HandlerConfig, l *log.Logger) alert.Handler {
 		endpoint: e,
 		logger:   l,
 		headers:  c.Headers,
+		hc:       &http.Client{Timeout: c.Timeout},


I like the idea of using context.WithTimeout on each request, but if you choose a different approach and use a custom http.Client be sure to add a transport that uses ProxyFromEnvironment.

&http.Client{ Timeout: c.Timeout, Trasport: &http.Transport{ Proxy: http.ProxyFromEnvironment, }, }

sputnik13 · 2017-08-25T21:02:10Z

@nathanielc @desa sorry for the delay, seems like I had this just sitting on my computer, is this what you had in mind?

desa · 2017-08-28T16:22:52Z

http_post.go

 	req.Header.Set("Content-Type", "application/json")
 	for k, v := range n.c.Headers {
 		req.Header.Set(k, v)
 	}
+
+	// Set timeout
+	ctx, cancel := context.WithTimeout(req.Context(), h.timeout)


Should this be hn.Timeout?

that should have been n.timeout, not sure how I missed that, could have sworn it passed local tests... pushed new change.

desa · 2017-09-07T14:15:37Z

it should be the same suite of tests. Looks like the following tests are failing in CI ``` --- FAIL: TestBatch_AlertStateChangesOnly (0.03s) --- FAIL: TestBatch_AlertStateChangesOnlyExpired (0.02s) --- FAIL: TestStream_HttpPost (0.01s) --- FAIL: TestStream_HttpPostEndpoint (0.01s) --- FAIL: TestStream_Alert (0.02s) --- FAIL: TestStream_Alert_NoRecoveries (0.05s) --- FAIL: TestStream_Alert_WithReset_0 (0.05s) --- FAIL: TestStream_Alert_WithReset_1 (0.02s) --- FAIL: TestStream_AlertDuration (0.03s) --- FAIL: TestStream_AlertHTTPPost (0.02s) --- FAIL: TestStream_AlertHTTPPostEndpoint (0.05s) --- FAIL: TestStream_AlertSigma (0.02s) --- FAIL: TestStream_AlertComplexWhere (0.02s) --- FAIL: TestStream_AlertStateChangesOnly (0.04s) --- FAIL: TestStream_AlertStateChangesOnlyExpired (0.01s) --- FAIL: TestStream_AlertFlapping (0.04s) FAIL FAIL github.com/influxdata/kapacitor/integrations 4.798s --- FAIL: TestServer_ListServiceTests (0.02s) --- FAIL: TestServer_AlertHandlers (1.01s) --- FAIL: TestServer_AlertHandlers/post-8 (0.09s) FAIL FAIL github.com/influxdata/kapacitor/server 20.777s ``` Michael Desa On Sep 6 2017, at 4:57 pm, Min Pae <notifications@github.com> wrote: it's passing tests for me locally, does the ci gate run a different set of tests than test.sh? —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

sputnik13 · 2017-09-12T17:18:57Z

looking in to it

sputnik13 · 2017-09-14T17:58:56Z

@desa looks like tests are passing now, thanks for hint on test environment

desa · 2017-09-14T18:47:03Z

LGTM.

nathanielc · 2017-09-14T19:26:23Z

services/httppost/service.go

+		ctx, cancel := context.WithTimeout(req.Context(), h.timeout)
+		defer cancel()
+		req = req.WithContext(ctx)
+	}


This code is repeated above as well. Since all code to get a request goes through https://github.com/sputnik13/kapacitor/blob/50f47625628a0937873ec60cda54c4b6d880ab0a/services/httppost/service.go#L50 why not move the logic to set a Timeout context on the request into that method?

The function will probably need a new parameter for the desired timeout since the source of the timeout varies, but that seems reasonable.

I'm not sure that's necessarily better than setting the timeout locally. While there would be less code I think clarity will suffer since the function interface for NewHTTPRequest will have to return a cancel functor for the context, but it could be misconstrued as a cancel functor for the request. I wish there was a means to set the timeout on the Request without having to pass in a context object.

If you disagree and think it's clear enough I can make the change.

sputnik13 · 2017-09-23T19:13:33Z

@nathanielc @desa I've been digging in to the reason for the race condition, and it seems due to how the DefaultTransport used by the DefaultClient is using a separate goroutine to perform the actual write.

When a .Do is called on the client, it ultimately ends up calling .RoundTrip on the default transport (https://github.com/golang/go/blob/master/src/net/http/transport.go#L341), which ends up writing the write request to a channel (https://github.com/golang/go/blob/master/src/net/http/transport.go#L1955), which is then handled by the writeLoop (https://github.com/golang/go/blob/master/src/net/http/transport.go#L1761) in a separate goroutine.

When a request or context is cancelled, there seems to be no feedback/feedforward to this writer. Looking through the http request code, it seems like there's an expectation that ownership of the object that provides the Body is handed off to the client (https://github.com/golang/go/blob/master/src/net/http/request.go#L779) since it will attempt to close the object. Coupled with the fact that the http lib seems to do a shallow copy of the request at every opportunity, this means that ownership of the buffer has to be relinquished when it is passed to hc.Do as part of the Request.

I understand why a Buffer.Reset() is performed before it is placed back on the Pool, but attempting the Reset results in the caller to hc.Do and the writeLoop potentially accessing the Buffer at the same time.

I'm wondering whether this means in order to utilize a pool of buffers, a new buffer pool implementation is needed such that closing the buffer returns it to the pool, rather than requiring the caller to Pool.Get also Put it back to the pool. Since the http client lib closes the Body on completion (whether successful or error) it would ensure the buffer is returned once http client is done with it.

sputnik13 · 2017-09-23T20:10:32Z

Question... is the motivation for using a buffer pool to eventually put a limit on maximum memory usage or is it to avoid GC?

sputnik13 · 2017-09-23T23:55:48Z

this latest commit has a quick stab at a buffer pool that provides a closable buffer and uses .Close() on the buffer to return it to the pool.. this seems to resolve the race condition between kapacitor and http client

Adding timeout to alert.post and http_post nodes. Currently http.DefaultClient is used, which blocks indefinitely when the endpoint accepts the connection but does not respond to the HTTP request. Switching to using a non-default http.Client with a timeout defined per node.

bufpool provided buffers result in race condition when used with a buffer user that expects to take ownership of the buffer (like http client appears to). Returning an extended version of Buffer that allows it to be closed, where the closing the buffer performs a Reset() on the buffer and returns it to the pool it came from.

nathanielc

@sputnik13 Sorry for the long back and forth on this one. I say we simplify and remove the use of the buffer pool entirely, (reverting the changes to the bufpool package). Then this should be good to go! 🎉

nathanielc · 2017-10-26T16:47:14Z

http_post.go

@@ -160,7 +165,6 @@ func (n *HTTPPostNode) doPost(row *models.Row) int {

 func (n *HTTPPostNode) postRow(row *models.Row) (*http.Response, error) {
 	body := n.bp.Get()
-	defer n.bp.Put(body)


This change can leak byte buffers from the pool. This is possible when there is an error writing to the buffer and then the early return. This isn't horrible as the pool will just create new buffers but it would be best if it were not possible.

Maybe we should just ditch the idea of a buffer pool here and use a new buffer each time. Then we can later address the GC pressure if its an issue. We should leave a comment though that using a buffer pool can cause a race when the request is canceled.

good point, there's at least 3 places where the code may return without closing the buffer, it's too bad there's no way to cancel defers, that would make things a lot simpler

adding timeout to alert.post and http_post

nathanielc · 2017-10-27T18:54:29Z

@sputnik13 Thanks!

desa suggested changes Jul 24, 2017

View reviewed changes

sputnik13 force-pushed the post_timeout branch from 93bbe35 to c6c66ee Compare August 25, 2017 21:04

desa suggested changes Aug 28, 2017

View reviewed changes

sputnik13 force-pushed the post_timeout branch from c6c66ee to ef186e8 Compare September 6, 2017 19:15

sputnik13 force-pushed the post_timeout branch 6 times, most recently from be65ab6 to 50f4762 Compare September 14, 2017 03:35

desa approved these changes Sep 14, 2017

View reviewed changes

nathanielc reviewed Sep 14, 2017

View reviewed changes

sputnik13 force-pushed the post_timeout branch 7 times, most recently from 0416cef to dba9fb9 Compare September 23, 2017 17:40

sputnik13 force-pushed the post_timeout branch from bc93767 to ae75528 Compare October 9, 2017 17:55

sputnik13 force-pushed the post_timeout branch from ae75528 to f78f6ca Compare October 19, 2017 20:25

sputnik13 added 2 commits October 24, 2017 15:45

sputnik13 force-pushed the post_timeout branch from f78f6ca to bf4cc98 Compare October 25, 2017 01:08

nathanielc suggested changes Oct 26, 2017

View reviewed changes

reverting bufpool

598fde6

nathanielc merged commit 598fde6 into influxdata:master Oct 27, 2017

nathanielc added a commit that referenced this pull request Oct 27, 2017

Merged pull request #1462 from sputnik13/post_timeout

d97f29f

adding timeout to alert.post and http_post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding timeout to alert.post and http_post #1462

adding timeout to alert.post and http_post #1462

sputnik13 commented Jun 29, 2017 •

edited

Loading

nathanielc commented Jul 24, 2017

nathanielc commented Jul 24, 2017

desa commented Jul 24, 2017

sputnik13 commented Jul 24, 2017

desa Jul 24, 2017

sputnik13 commented Aug 25, 2017

desa Aug 28, 2017

sputnik13 Sep 6, 2017

desa commented Sep 7, 2017 via email •

edited

Loading

sputnik13 commented Sep 12, 2017

sputnik13 commented Sep 14, 2017

desa commented Sep 14, 2017

nathanielc Sep 14, 2017

sputnik13 Sep 17, 2017

sputnik13 commented Sep 23, 2017 •

edited

Loading

sputnik13 commented Sep 23, 2017

sputnik13 commented Sep 23, 2017 •

edited

Loading

nathanielc left a comment •

edited

Loading

nathanielc Oct 26, 2017

sputnik13 Oct 26, 2017

nathanielc commented Oct 27, 2017

adding timeout to alert.post and http_post #1462

adding timeout to alert.post and http_post #1462

Conversation

sputnik13 commented Jun 29, 2017 • edited Loading

Required for all non-trivial PRs

nathanielc commented Jul 24, 2017

nathanielc commented Jul 24, 2017

desa commented Jul 24, 2017

sputnik13 commented Jul 24, 2017

desa Jul 24, 2017

Choose a reason for hiding this comment

sputnik13 commented Aug 25, 2017

desa Aug 28, 2017

Choose a reason for hiding this comment

sputnik13 Sep 6, 2017

Choose a reason for hiding this comment

desa commented Sep 7, 2017 via email • edited Loading

sputnik13 commented Sep 12, 2017

sputnik13 commented Sep 14, 2017

desa commented Sep 14, 2017

nathanielc Sep 14, 2017

Choose a reason for hiding this comment

sputnik13 Sep 17, 2017

Choose a reason for hiding this comment

sputnik13 commented Sep 23, 2017 • edited Loading

sputnik13 commented Sep 23, 2017

sputnik13 commented Sep 23, 2017 • edited Loading

nathanielc left a comment • edited Loading

Choose a reason for hiding this comment

nathanielc Oct 26, 2017

Choose a reason for hiding this comment

sputnik13 Oct 26, 2017

Choose a reason for hiding this comment

nathanielc commented Oct 27, 2017

sputnik13 commented Jun 29, 2017 •

edited

Loading

desa commented Sep 7, 2017 via email •

edited

Loading

sputnik13 commented Sep 23, 2017 •

edited

Loading

sputnik13 commented Sep 23, 2017 •

edited

Loading

nathanielc left a comment •

edited

Loading