Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: fix memory leak with large message #996

Merged

Conversation

andyxning
Copy link
Member

@andyxning andyxning commented Feb 14, 2018

This PR will make bytes.Buffer local variable for each message instead of a global one for each connection.

Image Once we have a very large message, for example 20MB, after we SendMessage with the global bytes.Buffer, the underlying array will be expanded to more than 20MB to contain this large message. However, after the large message is sent, the bytes.Buffer will not shrink to decrease memory consumption until the connection is closed and bytes.Buffer is GCed. And, once more larger messages need to be processed for a connection, the butes.Buffer will be continuously enlarged.

We should make butes.Buffer a local variable for each message. This will definitely increase the cpu usage as we have more objects which are needed to be GC. But after several iteration GC enhance for golang, the GC performance is good enough for recently golang releases(>=1.8). So, GC overhead should be ok. Actually in one of our production clusters, after the update of this change, the memory usage is decreased dramatically and the cpu usage is almost the same.(Disclaimer: our internal nsqd is compiled with golang 1.8.4) :)

/cc @mreiferson @jehiah @ploxiln

@andyxning andyxning changed the title fix memory leak with large message nsqd: fix memory leak with large message Feb 14, 2018
@mreiferson
Copy link
Member

This was done intentionally. The tradeoff here is that we can vastly reduce allocations by maintaining the buffer at the connection level.

Although your analysis is correct, in the common case, we don't expect wildly different message sizes to be consumed on a single connection (channel).

@andyxning
Copy link
Member Author

@mreiferson

we don't expect wildly different message sizes to be consumed on a single connection (channel).

I do not think we can control this as messages are application specific and every message is fan outed to every channel. Neither nsqd server side nor go-nsq/pynsq client side has some command line to control the message load balance between different connections.

@ploxiln
Copy link
Member

ploxiln commented Feb 21, 2018

  -max-msg-size int
    	maximum size of a single message in bytes (default 1048576)

should work to limit message size and thus the bytes.Buffer size. (Though, as you say, there will be one per connection.)

@jehiah
Copy link
Member

jehiah commented Feb 21, 2018

We could possibly try using a per-connection sync.Pool to manage the bytes.Buffer allowing it to still be generally long lived (avoiding GC pressure) but also reset occasionally. That would probably have some performance penalty but it might be acceptable.

@mreiferson
Copy link
Member

mreiferson commented Feb 21, 2018

@jehiah's suggestion seems reasonable, but have you actually observed an issue in your use case @andyxning, or is this just theory?

@andyxning andyxning deleted the fix_memory_leak_with_large_message branch February 23, 2018 00:32
@andyxning andyxning restored the fix_memory_leak_with_large_message branch February 23, 2018 00:33
@andyxning
Copy link
Member Author

@mreiferson Yes, the memory leak has a big impact on our usage because we set -max-msg-size to 20MB and the truth is the message size is varying dramatically. After making the bytes.Buffer channel message level, the memory has been largely decreased.

@andyxning
Copy link
Member Author

abc

@andyxning
Copy link
Member Author

def

But, after making the bytes.Buffer message level, the cpu usage has not been increased largely.

@andyxning
Copy link
Member Author

andyxning commented Feb 26, 2018

@mreiferson @jehiah

We could possibly try using a per-connection sync.Pool to manage the bytes.Buffer allowing it to still be generally long lived (avoiding GC pressure) but also reset occasionally. That would probably have some performance penalty but it might be acceptable.

The key point is to implement the shrink logic when large buffers are not needed. I think this can be implemented with sync.Pool and []byte instead of sync.Pool and bytes.Buffer as we can not shrink the bytes.Buffer. But it will definitely increase the complexity. BTW, according to the enhanced GC in Golang, i think the message-level buffer also seems as an option and it makes the implementation simple.

@andyxning
Copy link
Member Author

Ping @mreiferson @jehiah

@andyxning
Copy link
Member Author

Ping @mreiferson @jehiah :)

@mreiferson
Copy link
Member

@andyxning those graphs are helpful, can you run the suite of benchmarks before and after and post the comparison?

@mreiferson mreiferson reopened this Mar 3, 2018
@andyxning
Copy link
Member Author

@mreiferson OK, Will give the feedback ASAP.

@andyxning
Copy link
Member Author

andyxning commented Mar 5, 2018

@mreiferson @ploxiln @jehiah

I use v0.3.8 internally, so the tests are based on v0.3.8 using ./bench.sh. I have no available AWS resources to do a distributed performance test, so below is only a single instance performance test.

GoLang: go version go1.8.4 linux/amd64
OS: Linux 4.4.0-0.bpo.1-amd64 #1 SMP Debian 4.4.6-1~bpo8+1 (2016-04-21) x86_64 GNU/Linux

Official(I.e., v0.3.8 branch):

# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
# compiling/running nsqd
# creating topic/channel
# compiling bench_reader/bench_writer
PUB: [bench_writer] 2018/03/05 14:01:07 duration: 10.013884804s - 48.391mb/s - 253707.732ops/s - 3.942us/op
SUB: [bench_reader] 2018/03/05 14:01:17 duration: 10.004050988s - 40.968mb/s - 214791.088ops/s - 4.656us/op
waiting for pprof...

Adjusted Official(I.e., patch based on v0.3.8 branch):

# using --mem-queue-size=1000000 --data-path= --size=200 --batch-size=200
# compiling/running nsqd
# creating topic/channel
# compiling bench_reader/bench_writer
PUB: [bench_writer] 2018/03/05 14:04:03 duration: 10.011562901s - 48.044mb/s - 251888.744ops/s - 3.970us/op
SUB: [bench_reader] 2018/03/05 14:04:13 duration: 10.007964687s - 41.703mb/s - 218643.857ops/s - 4.574us/op
waiting for pprof...

It seems that the sub performance has indeed some degrade but it is quite small. I think does not have a major performance penalty. WDYT.

@mreiferson mreiferson added perf and removed wontfix labels Mar 6, 2018
@mreiferson
Copy link
Member

OK, I'm satisfied with these results, thanks for indulging me @andyxning — any final concerns before merging @ploxiln @jehiah?

@@ -318,6 +316,7 @@ func (p *protocolV2) messagePump(client *clientV2, startedChan chan bool) {

subChannel.StartInFlightTimeout(msg, client.ID, msgTimeout)
client.SendingMessage()
var buf bytes.Buffer
err = p.SendMessage(client, msg, &buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're no longer re-using the buf, you can simplify this further: declare it inside SendMessage(), and remove the argument.

Copy link
Member Author

@andyxning andyxning Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Done. PTAL. @ploxiln

@ploxiln
Copy link
Member

ploxiln commented Mar 6, 2018

Do we want this rebased on a more recent commit?

$ git log -n2 nsqio/pull/996
commit 9aa00f42d0ab8a0fd68dce87732751877f704889 (nsqio/pull/996)
Author: Andy Xie <andy.xning@gmail.com>
Date:   Wed Feb 14 14:24:41 2018 +0800

    fix memory leak with large message

commit c53f30f8ea0cbe3bf95116cc0d6f603f5a387620 (tag: v0.3.8)
Author: Matt Reiferson <mreiferson@gmail.com>
Date:   Thu May 26 07:22:12 2016 -0700

    bump v0.3.8 stable

@ploxiln
Copy link
Member

ploxiln commented Mar 6, 2018

I locally rebased on master and compared ./bench.sh results, and also saw small and inconclusive performance reduction - maybe 1%, but depends more on whether my laptop had 4 or 8 seconds to cool down between runs than on which branch I'm testing.

@andyxning andyxning force-pushed the fix_memory_leak_with_large_message branch 3 times, most recently from d490e48 to 3d05881 Compare March 6, 2018 07:28
@andyxning
Copy link
Member Author

@mreiferson Seems there are no more additional suggestions from @ploxiln and @jehiah . Can you help to merge this?

@ploxiln
Copy link
Member

ploxiln commented Mar 6, 2018

Looks good to me. I also re-bench-tested out of paranoia, same results. I'll merge in 12 hours or so if there are no objections.

@mreiferson mreiferson force-pushed the fix_memory_leak_with_large_message branch from 3d05881 to 725c653 Compare March 6, 2018 19:34
@mreiferson
Copy link
Member

fixed up commit msg, thanks @andyxning

@mreiferson mreiferson merged commit aab6d2e into nsqio:master Mar 6, 2018
@andyxning andyxning deleted the fix_memory_leak_with_large_message branch March 7, 2018 02:04
@andyxning
Copy link
Member Author

Thanks for the review and advise! Learns a lot. @mreiferson @ploxiln @jehiah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants