Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine leak #854

Closed
pebrc opened this issue May 12, 2019 · 8 comments · Fixed by #855
Closed

Goroutine leak #854

pebrc opened this issue May 12, 2019 · 8 comments · Fixed by #855

Comments

@pebrc
Copy link
Collaborator

pebrc commented May 12, 2019

When deploying multiple Elasticsearch clusters (here 5) I am seeing the operator process being killed repeatedly because it exceeds its memory limit

Screenshot 2019-05-12 at 13 48 27

@pebrc pebrc changed the title Resource limits for all-in-one deployment might be too low Potential memory leak May 12, 2019
@pebrc
Copy link
Collaborator Author

pebrc commented May 12, 2019

Screenshot 2019-05-12 at 18 12 30

@pebrc
Copy link
Collaborator Author

pebrc commented May 12, 2019

Showing nodes accounting for 21265.38kB, 100% of 21265.38kB total
Showing top 50 nodes out of 190
      flat  flat%   sum%        cum   cum%
 3079.51kB 14.48% 14.48%  3079.51kB 14.48%  crypto/tls.(*block).reserve
 2072.09kB  9.74% 24.23%  2072.09kB  9.74%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/beorn7/perks/quantile.newStream
 2050.75kB  9.64% 33.87%  2562.76kB 12.05%  crypto/x509.parseCertificate
 1542.01kB  7.25% 41.12%  1542.01kB  7.25%  bufio.NewReaderSize (inline)
 1542.01kB  7.25% 48.37%  1542.01kB  7.25%  bufio.NewWriterSize (inline)
 1026.25kB  4.83% 53.20%  1026.25kB  4.83%  bytes.makeSlice
 1026.25kB  4.83% 58.02%  5132.01kB 24.13%  crypto/tls.(*Conn).readHandshake
  650.62kB  3.06% 61.08%   650.62kB  3.06%  github.com/elastic/cloud-on-k8s/operators/vendor/golang.org/x/net/http2.(*ClientConn).frameScratchBuffer
  591.75kB  2.78% 63.87%   591.75kB  2.78%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/modern-go/reflect2.loadGo17Types
  513.50kB  2.41% 66.28%   513.50kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/client-go/restmapper.NewDiscoveryRESTMapper
  512.88kB  2.41% 68.69%   512.88kB  2.41%  encoding/json.typeFields
  512.62kB  2.41% 71.10%   512.62kB  2.41%  encoding/base64.(*Encoding).DecodeString
  512.23kB  2.41% 73.51%   512.23kB  2.41%  net/http.(*Transport).getIdleConnCh
  512.19kB  2.41% 75.92%   512.19kB  2.41%  runtime.malg
  512.17kB  2.41% 78.33%   512.17kB  2.41%  crypto/x509.(*CertPool).AddCert
  512.16kB  2.41% 80.74%   512.16kB  2.41%  crypto/aes.(*aesCipherGCM).NewGCM
  512.14kB  2.41% 83.15%   512.14kB  2.41%  reflect.mapassign
  512.12kB  2.41% 85.55%   512.12kB  2.41%  crypto/aes.newCipher
  512.06kB  2.41% 87.96%   512.06kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/modern-go/reflect2.newUnsafeStructField
  512.02kB  2.41% 90.37%   512.02kB  2.41%  net.sockaddrToTCP
  512.02kB  2.41% 92.78%   512.02kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/json-iterator/go.(*Iterator).ReadString
  512.02kB  2.41% 95.18%   512.02kB  2.41%  github.com/elastic/cloud-on-k8s/operators/vendor/github.com/json-iterator/go.processTags
  512.01kB  2.41% 97.59%   512.01kB  2.41%  encoding/asn1.parseField
  512.01kB  2.41%   100%   512.01kB  2.41%  errors.New

@pebrc
Copy link
Collaborator Author

pebrc commented May 12, 2019

It looks like we are leaking goroutines. Number of goroutines keeps growing until the process is killed.
The offenders are http connections I believe.

goroutine profile: total 1141
451 @ 0x42e00b 0x429339 0x4289e6 0x49525a 0x49536d 0x4960b9 0x59657f 0x5a9848 0x6014a9 0x6019bd 0x604f61 0x6b90b5 0x5483ff 0x54855f 0x6b9c02 0x45be71
#	0x4289e5	internal/poll.runtime_pollWait+0x65	/usr/local/go/src/runtime/netpoll.go:173
#	0x495259	internal/poll.(*pollDesc).wait+0x99	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#	0x49536c	internal/poll.(*pollDesc).waitRead+0x3c	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90
#	0x4960b8	internal/poll.(*FD).Read+0x178		/usr/local/go/src/internal/poll/fd_unix.go:169
#	0x59657e	net.(*netFD).Read+0x4e			/usr/local/go/src/net/fd_unix.go:202
#	0x5a9847	net.(*conn).Read+0x67			/usr/local/go/src/net/net.go:177
#	0x6014a8	crypto/tls.(*block).readFromUntil+0x88	/usr/local/go/src/crypto/tls/conn.go:492
#	0x6019bc	crypto/tls.(*Conn).readRecord+0xdc	/usr/local/go/src/crypto/tls/conn.go:593
#	0x604f60	crypto/tls.(*Conn).Read+0xf0		/usr/local/go/src/crypto/tls/conn.go:1145
#	0x6b90b4	net/http.(*persistConn).Read+0x74	/usr/local/go/src/net/http/transport.go:1497
#	0x5483fe	bufio.(*Reader).fill+0x10e		/usr/local/go/src/bufio/bufio.go:100
#	0x54855e	bufio.(*Reader).Peek+0x3e		/usr/local/go/src/bufio/bufio.go:132
#	0x6b9c01	net/http.(*persistConn).readLoop+0x1a1	/usr/local/go/src/net/http/transport.go:1645

451 @ 0x42e00b 0x43e076 0x6bb1c3 0x45be71
#	0x6bb1c2	net/http.(*persistConn).writeLoop+0x112	/usr/local/go/src/net/http/transport.go:1885

@pebrc
Copy link
Collaborator Author

pebrc commented May 13, 2019

As I understand it we are leaking HTTP connections. I tried

  • disabling keep alives already with no success
  • ensuring we close/consume response bodies even on error without success
  • dialing down max idle connections (even though my understanding would be that turning off keep alives should make that setting redundant)

Screenshot 2019-05-13 at 09 22 06

@pebrc pebrc changed the title Potential memory leak Goroutine leak May 13, 2019
@sebgl
Copy link
Contributor

sebgl commented May 13, 2019

I think we might be leaking goroutines from keep-alive connections in the http client's Transport struct we use to request Elasticsearch. That client along with its transport profile gets recreated at each reconciliation.

I did some experiments (nice way to spend time in the plane 😄):

func main() {
	// test http server
	server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		io.WriteString(w, "Hello world")
	}))
	// simulate reconciliation loop
	var memStats runtime.MemStats
	for {
		// create a new client and a new transport at every iteration
		transport := &http.Transport{
			// DisableKeepAlives: true,
		}
		c := &http.Client{Transport: transport}

		resp, _ := c.Get(server.URL)
		io.Copy(ioutil.Discard, resp.Body)
		resp.Body.Close()

		// close idle keep-alive connections
		// transport.CloseIdleConnections()

		runtime.ReadMemStats(&memStats)
		log.Printf("%d goroutines, %d heapalloc", runtime.NumGoroutine(), memStats.HeapAlloc)
		time.Sleep(10 * time.Millisecond)
	}
}

Output:

2019/05/13 12:35:37 5 goroutines, 249008 heapalloc
2019/05/13 12:35:37 8 goroutines, 301040 heapalloc
2019/05/13 12:35:37 11 goroutines, 341984 heapalloc
2019/05/13 12:35:37 14 goroutines, 373976 heapalloc
(...)
2019/05/13 12:35:57 860 goroutines, 7732936 heapalloc
(...)
2019/05/13 12:36:16 2597 goroutines, 22930640 heapalloc

Notice how both the number of goroutines and the heap size increase over time.

Now, output of the same code but this time I uncomment transport.CloseIdleConnections():

2019/05/13 12:37:19 4 goroutines, 252000 heapalloc
2019/05/13 12:37:19 4 goroutines, 299872 heapalloc
2019/05/13 12:37:19 4 goroutines, 331232 heapalloc
2019/05/13 12:37:19 5 goroutines, 360816 heapalloc
2019/05/13 12:37:19 5 goroutines, 388936 heapalloc
2019/05/13 12:37:19 4 goroutines, 409136 heapalloc
(...)
2019/05/13 12:37:40 4 goroutines, 477768 heapalloc
(...)
2019/05/13 12:37:57 5 goroutines, 606672 heapalloc

Number of goroutines stays constant over time (4-5), we see heap size increase but is regularly reset (GC doing its job).

It looks like if we don't explicitly close idle keep-alive connections, the http client's transport is not terminated and garbage collected as we would expect, and keeps long-lived connections forever. Effectively increasing goroutines and heap over time.

Some options to fix that:

  1. Add a Close() method to our ES client that we call at the end of each reconciliation: internally, it would call transport.CloseIdleConnections()
  2. Set DisableKeepAlives: true in the Transport configuration (disable keep alive, it gives us the same 2nd output in the test program above)
  3. Persist http client and its transport across multiple reconciliations.

Generally it seems Go http client and transport are aimed to be thread-safe and optimized for reuse. I think we should aim for 3. in the long term. Since we also have Observers that persist a connection to the ES cluster, I think we could encapsulate a cluster's client that we keep around to be reused both by observers and reconciliations, for a given cluster.

Note that I'm not sure this is the exact (only?) problem we have, I did not test these settings yet in the operator.

Edit: I just noticed @pebrc last comment about how disabling keep-alive connections does not help, so that's probably not the problem :/ Did you disable it through DisableKeepAlives: true @pebrc?

@sebgl
Copy link
Contributor

sebgl commented May 13, 2019

I did more testing, in the operator codebase this time.

See this debugging branch: master...sebgl:debug-memleak

In that branch I output the current number of goroutines and heap size every 100 ms.
Once we have one ES cluster deployed, it's pretty clear by looking at the logs that every reconciliation attempt + every observer iteration does increase the number of goroutines.

I changed the implementation of both ES client and process manager client to close leftover idle connections from the transport once the client is not required.

I now get a stable number of goroutines! 🎉 I think we've spotted the problem :)

@pebrc
Copy link
Collaborator Author

pebrc commented May 13, 2019

Nice work @sebgl! (and yes I used DisableKeepAlives instead of CloseIdleConnections)

@pebrc
Copy link
Collaborator Author

pebrc commented May 13, 2019

Screenshot 2019-05-13 at 13 59 01

Disabling keep alive might have worked after all, might have been running an old version of the code initially due to our tagging only picking up on new commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants