nsqd: fix memory leak with large message #996

andyxning · 2018-02-14T06:34:24Z

This PR will make bytes.Buffer local variable for each message instead of a global one for each connection.

Image Once we have a very large message, for example 20MB, after we SendMessage with the global bytes.Buffer, the underlying array will be expanded to more than 20MB to contain this large message. However, after the large message is sent, the bytes.Buffer will not shrink to decrease memory consumption until the connection is closed and bytes.Buffer is GCed. And, once more larger messages need to be processed for a connection, the butes.Buffer will be continuously enlarged.

We should make butes.Buffer a local variable for each message. This will definitely increase the cpu usage as we have more objects which are needed to be GC. But after several iteration GC enhance for golang, the GC performance is good enough for recently golang releases(>=1.8). So, GC overhead should be ok. Actually in one of our production clusters, after the update of this change, the memory usage is decreased dramatically and the cpu usage is almost the same.(Disclaimer: our internal nsqd is compiled with golang 1.8.4) :)

/cc @mreiferson @jehiah @ploxiln

mreiferson · 2018-02-19T17:25:47Z

This was done intentionally. The tradeoff here is that we can vastly reduce allocations by maintaining the buffer at the connection level.

Although your analysis is correct, in the common case, we don't expect wildly different message sizes to be consumed on a single connection (channel).

andyxning · 2018-02-21T08:18:08Z

@mreiferson

we don't expect wildly different message sizes to be consumed on a single connection (channel).

I do not think we can control this as messages are application specific and every message is fan outed to every channel. Neither nsqd server side nor go-nsq/pynsq client side has some command line to control the message load balance between different connections.

ploxiln · 2018-02-21T14:36:59Z

  -max-msg-size int
    	maximum size of a single message in bytes (default 1048576)

should work to limit message size and thus the bytes.Buffer size. (Though, as you say, there will be one per connection.)

jehiah · 2018-02-21T14:46:20Z

We could possibly try using a per-connection sync.Pool to manage the bytes.Buffer allowing it to still be generally long lived (avoiding GC pressure) but also reset occasionally. That would probably have some performance penalty but it might be acceptable.

mreiferson · 2018-02-21T18:13:13Z

@jehiah's suggestion seems reasonable, but have you actually observed an issue in your use case @andyxning, or is this just theory?

andyxning · 2018-02-23T00:49:50Z

@mreiferson Yes, the memory leak has a big impact on our usage because we set -max-msg-size to 20MB and the truth is the message size is varying dramatically. After making the bytes.Buffer channel message level, the memory has been largely decreased.

andyxning · 2018-02-23T00:51:55Z

andyxning · 2018-02-23T01:06:02Z

But, after making the bytes.Buffer message level, the cpu usage has not been increased largely.

andyxning · 2018-02-26T05:17:16Z

@mreiferson @jehiah

We could possibly try using a per-connection sync.Pool to manage the bytes.Buffer allowing it to still be generally long lived (avoiding GC pressure) but also reset occasionally. That would probably have some performance penalty but it might be acceptable.

The key point is to implement the shrink logic when large buffers are not needed. I think this can be implemented with sync.Pool and []byte instead of sync.Pool and bytes.Buffer as we can not shrink the bytes.Buffer. But it will definitely increase the complexity. BTW, according to the enhanced GC in Golang, i think the message-level buffer also seems as an option and it makes the implementation simple.

andyxning · 2018-02-27T22:38:10Z

Ping @mreiferson @jehiah

andyxning · 2018-03-03T15:20:04Z

Ping @mreiferson @jehiah :)

mreiferson · 2018-03-03T18:06:19Z

@andyxning those graphs are helpful, can you run the suite of benchmarks before and after and post the comparison?

andyxning · 2018-03-04T01:32:02Z

@mreiferson OK, Will give the feedback ASAP.

andyxning · 2018-03-05T06:06:24Z

@mreiferson @ploxiln @jehiah

I use v0.3.8 internally, so the tests are based on v0.3.8 using ./bench.sh. I have no available AWS resources to do a distributed performance test, so below is only a single instance performance test.

GoLang: go version go1.8.4 linux/amd64
OS: Linux 4.4.0-0.bpo.1-amd64 #1 SMP Debian 4.4.6-1~bpo8+1 (2016-04-21) x86_64 GNU/Linux

Official(I.e., v0.3.8 branch):

# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
# compiling/running nsqd
# creating topic/channel
# compiling bench_reader/bench_writer
PUB: [bench_writer] 2018/03/05 14:01:07 duration: 10.013884804s - 48.391mb/s - 253707.732ops/s - 3.942us/op
SUB: [bench_reader] 2018/03/05 14:01:17 duration: 10.004050988s - 40.968mb/s - 214791.088ops/s - 4.656us/op
waiting for pprof...

Adjusted Official(I.e., patch based on v0.3.8 branch):

# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
# compiling/running nsqd
# creating topic/channel
# compiling bench_reader/bench_writer
PUB: [bench_writer] 2018/03/05 14:04:03 duration: 10.011562901s - 48.044mb/s - 251888.744ops/s - 3.970us/op
SUB: [bench_reader] 2018/03/05 14:04:13 duration: 10.007964687s - 41.703mb/s - 218643.857ops/s - 4.574us/op
waiting for pprof...

It seems that the sub performance has indeed some degrade but it is quite small. I think does not have a major performance penalty. WDYT.

mreiferson · 2018-03-06T03:18:40Z

OK, I'm satisfied with these results, thanks for indulging me @andyxning — any final concerns before merging @ploxiln @jehiah?

ploxiln · 2018-03-06T03:54:52Z

nsqd/protocol_v2.go

@@ -318,6 +316,7 @@ func (p *protocolV2) messagePump(client *clientV2, startedChan chan bool) {

 			subChannel.StartInFlightTimeout(msg, client.ID, msgTimeout)
 			client.SendingMessage()
+			var buf bytes.Buffer
 			err = p.SendMessage(client, msg, &buf)


Since we're no longer re-using the buf, you can simplify this further: declare it inside SendMessage(), and remove the argument.

Good suggestion! Done. PTAL. @ploxiln

ploxiln · 2018-03-06T03:58:00Z

Do we want this rebased on a more recent commit?

$ git log -n2 nsqio/pull/996
commit 9aa00f42d0ab8a0fd68dce87732751877f704889 (nsqio/pull/996)
Author: Andy Xie <andy.xning@gmail.com>
Date:   Wed Feb 14 14:24:41 2018 +0800

    fix memory leak with large message

commit c53f30f8ea0cbe3bf95116cc0d6f603f5a387620 (tag: v0.3.8)
Author: Matt Reiferson <mreiferson@gmail.com>
Date:   Thu May 26 07:22:12 2016 -0700

    bump v0.3.8 stable

ploxiln · 2018-03-06T04:06:49Z

I locally rebased on master and compared ./bench.sh results, and also saw small and inconclusive performance reduction - maybe 1%, but depends more on whether my laptop had 4 or 8 seconds to cool down between runs than on which branch I'm testing.

andyxning · 2018-03-06T07:47:19Z

@mreiferson Seems there are no more additional suggestions from @ploxiln and @jehiah . Can you help to merge this?

ploxiln · 2018-03-06T17:00:56Z

Looks good to me. I also re-bench-tested out of paranoia, same results. I'll merge in 12 hours or so if there are no objections.

mreiferson · 2018-03-06T19:35:08Z

fixed up commit msg, thanks @andyxning

andyxning · 2018-03-07T02:06:39Z

Thanks for the review and advise! Learns a lot. @mreiferson @ploxiln @jehiah

andyxning changed the title ~~fix memory leak with large message~~ nsqd: fix memory leak with large message Feb 14, 2018

mreiferson closed this Feb 19, 2018

mreiferson added wontfix chore labels Feb 19, 2018

andyxning deleted the fix_memory_leak_with_large_message branch February 23, 2018 00:32

andyxning restored the fix_memory_leak_with_large_message branch February 23, 2018 00:33

mreiferson reopened this Mar 3, 2018

mreiferson added perf and removed wontfix labels Mar 6, 2018

ploxiln reviewed Mar 6, 2018

View reviewed changes

andyxning force-pushed the fix_memory_leak_with_large_message branch 3 times, most recently from d490e48 to 3d05881 Compare March 6, 2018 07:28

nsqd: per-message buffer to limit memory footprint for large messages

725c653

mreiferson force-pushed the fix_memory_leak_with_large_message branch from 3d05881 to 725c653 Compare March 6, 2018 19:34

mreiferson merged commit aab6d2e into nsqio:master Mar 6, 2018

andyxning deleted the fix_memory_leak_with_large_message branch March 7, 2018 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: fix memory leak with large message #996

nsqd: fix memory leak with large message #996

andyxning commented Feb 14, 2018 •

edited

Loading

mreiferson commented Feb 19, 2018

andyxning commented Feb 21, 2018

ploxiln commented Feb 21, 2018

jehiah commented Feb 21, 2018

mreiferson commented Feb 21, 2018 •

edited

Loading

andyxning commented Feb 23, 2018

andyxning commented Feb 23, 2018

andyxning commented Feb 23, 2018

andyxning commented Feb 26, 2018 •

edited

Loading

andyxning commented Feb 27, 2018

andyxning commented Mar 3, 2018

mreiferson commented Mar 3, 2018

andyxning commented Mar 4, 2018

andyxning commented Mar 5, 2018 •

edited

Loading

mreiferson commented Mar 6, 2018

ploxiln Mar 6, 2018

andyxning Mar 6, 2018 •

edited

Loading

ploxiln commented Mar 6, 2018

ploxiln commented Mar 6, 2018

andyxning commented Mar 6, 2018

ploxiln commented Mar 6, 2018

mreiferson commented Mar 6, 2018

andyxning commented Mar 7, 2018

nsqd: fix memory leak with large message #996

nsqd: fix memory leak with large message #996

Conversation

andyxning commented Feb 14, 2018 • edited Loading

mreiferson commented Feb 19, 2018

andyxning commented Feb 21, 2018

ploxiln commented Feb 21, 2018

jehiah commented Feb 21, 2018

mreiferson commented Feb 21, 2018 • edited Loading

andyxning commented Feb 23, 2018

andyxning commented Feb 23, 2018

andyxning commented Feb 23, 2018

andyxning commented Feb 26, 2018 • edited Loading

andyxning commented Feb 27, 2018

andyxning commented Mar 3, 2018

mreiferson commented Mar 3, 2018

andyxning commented Mar 4, 2018

andyxning commented Mar 5, 2018 • edited Loading

mreiferson commented Mar 6, 2018

ploxiln Mar 6, 2018

Choose a reason for hiding this comment

andyxning Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

ploxiln commented Mar 6, 2018

ploxiln commented Mar 6, 2018

andyxning commented Mar 6, 2018

ploxiln commented Mar 6, 2018

mreiferson commented Mar 6, 2018

andyxning commented Mar 7, 2018

andyxning commented Feb 14, 2018 •

edited

Loading

mreiferson commented Feb 21, 2018 •

edited

Loading

andyxning commented Feb 26, 2018 •

edited

Loading

andyxning commented Mar 5, 2018 •

edited

Loading

andyxning Mar 6, 2018 •

edited

Loading