Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: OOM when misconfiguring lookupd addresses #615

Merged
merged 1 commit into from
Oct 4, 2015

Conversation

mreiferson
Copy link
Member

Hi there,

I've been reconfiguring our deploy getting ready to launch our product and have been hitting brick wall after brick wall with an out of memory issue. At first I thought it was an issue with my VPC configuration so I created a clean instance and succeeded (with and without docker) to reproduce the issue.

It might be something Amazon has done lately with their AMI images, but everything was working fine a few weeks ago so wanted to check in and see if it's anything on your end.

Steps to reproduce

  1. Set up a new AWS instance (tried t2.micro and t2.small, they've had sufficient memory in the past)
  2. Download and untar the binaries - wget https://s3.amazonaws.com/bitly-downloads/nsq/nsq-0.3.5.linux-amd64.go1.4.2.tar.gz && tar zxvf nsq-0.3.5.linux-amd64.go1.4.2.tar.gz
  3. Make the binaries executable, copy them to the $PATH
  4. Run nsqlookupd - everything looks and smells fine
  5. Run nsqd -lookupd-tcp-address=127.0.0.1:4161 - and I get the following
nsqd -lookupd-tcp-address=127.0.0.1:4161
[nsqd] 2015/08/05 15:35:23.121973 nsqd v0.3.5 (built w/go1.4.2)
[nsqd] 2015/08/05 15:35:23.122003 ID: 985
[nsqd] 2015/08/05 15:35:23.122080 NSQ: persisting topic/channel metadata to nsqd.985.dat
[nsqd] 2015/08/05 15:35:23.125264 TCP: listening on [::]:4150
[nsqd] 2015/08/05 15:35:23.125297 HTTP: listening on [::]:4151
[nsqd] 2015/08/05 15:35:23.126960 LOOKUP: adding peer 127.0.0.1:4161
[nsqd] 2015/08/05 15:35:23.126968 LOOKUP connecting to 127.0.0.1:4161
fatal error: runtime: out of memory

runtime stack:
runtime.SysMap(0xc208100000, 0x48550000, 0x0, 0xa5abf8)
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/mem_linux.c:149 +0x98
runtime.MHeap_SysAlloc(0xa602e0, 0x48550000, 0x430a62)
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/malloc.c:284 +0x124
runtime.MHeap_Alloc(0xa602e0, 0x242a3, 0x10100000000, 0xc208001d40)
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/mheap.c:240 +0x66

goroutine 10 [running]:
runtime.switchtoM()
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/asm_amd64.s:198 fp=0xc20802b4a0 sp=0xc20802b498
runtime.mallocgc(0x48545450, 0x713f00, 0x1, 0x7fb180)
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/malloc.go:199 +0x9f3 fp=0xc20802b550 sp=0xc20802b4a0
runtime.newarray(0x713f00, 0x48545450, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/malloc.go:365 +0xc1 fp=0xc20802b588 sp=0xc20802b550
runtime.makeslice(0x705680, 0x48545450, 0x48545450, 0x0, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/slice.go:32 +0x15c fp=0xc20802b5d0 sp=0xc20802b588
github.com/bitly/go-nsq.ReadResponse(0x7f6e10c31a40, 0xc208010380, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/.godeps/src/github.com/bitly/go-nsq/protocol.go:60 +0x17e fp=0xc20802b660 sp=0xc20802b5d0
github.com/bitly/nsq/nsqd.(*lookupPeer).Command(0xc208010380, 0xc208036640, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/lookup_peer.go:106 +0x276 fp=0xc20802b708 sp=0xc20802b660
github.com/bitly/nsq/nsqd.func·029(0xc208010380)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/lookup.go:39 +0x424 fp=0xc20802b870 sp=0xc20802b708
github.com/bitly/nsq/nsqd.(*lookupPeer).Command(0xc208010380, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/lookup_peer.go:95 +0x132 fp=0xc20802b918 sp=0xc20802b870
github.com/bitly/nsq/nsqd.(*NSQD).lookupLoop(0xc208074200)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/lookup.go:57 +0x5b7 fp=0xc20802bfb8 sp=0xc20802b918
github.com/bitly/nsq/nsqd.func·034()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/nsqd.go:249 +0x2a fp=0xc20802bfc8 sp=0xc20802bfb8
github.com/bitly/nsq/internal/util.func·001()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:14 +0x2f fp=0xc20802bfe0 sp=0xc20802bfc8
runtime.goexit()
    /usr/local/Cellar/go/1.4.2/libexec/src/runtime/asm_amd64.s:2232 +0x1 fp=0xc20802bfe8 sp=0xc20802bfe0
created by github.com/bitly/nsq/internal/util.(*WaitGroupWrapper).Wrap
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:16 +0xf7

goroutine 1 [chan receive]:
main.main()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/apps/nsqd/nsqd.go:212 +0x929

goroutine 7 [IO wait]:
net.(*pollDesc).Wait(0xc208010300, 0x72, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc208010300, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc2080102a0, 0x0, 0x7f6e10c2fd08, 0xc208033550)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc2080340d0, 0xc208086180, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/tcpsock_posix.go:234 +0x4e
net.(*TCPListener).Accept(0xc2080340d0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/tcpsock_posix.go:244 +0x4c
github.com/bitly/nsq/internal/protocol.TCPServer(0x7f6e10c30858, 0xc2080340d0, 0x7f6e10c318a0, 0xc2080340e0, 0x7f6e10c318c8, 0xc208036460)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/protocol/tcp_server.go:20 +0x184
github.com/bitly/nsq/nsqd.func·030()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/nsqd.go:210 +0xe0
github.com/bitly/nsq/internal/util.func·001()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:14 +0x2f
created by github.com/bitly/nsq/internal/util.(*WaitGroupWrapper).Wrap
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:16 +0xf7

goroutine 6 [syscall]:
os/signal.loop()
    /usr/local/Cellar/go/1.4.2/libexec/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /usr/local/Cellar/go/1.4.2/libexec/src/os/signal/signal_unix.go:27 +0x35

goroutine 8 [IO wait]:
net.(*pollDesc).Wait(0xc208010370, 0x72, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc208010370, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc208010310, 0x0, 0x7f6e10c2fd08, 0xc2080335c8)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc2080340f8, 0xa50d00, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/tcpsock_posix.go:234 +0x4e
net.(*TCPListener).Accept(0xc2080340f8, 0x0, 0x0, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/tcpsock_posix.go:244 +0x4c
net/http.(*Server).Serve(0xc208038360, 0x7f6e10c30858, 0xc2080340f8, 0x0, 0x0)
    /usr/local/Cellar/go/1.4.2/libexec/src/net/http/server.go:1728 +0x92
github.com/bitly/nsq/internal/http_api.Serve(0x7f6e10c30858, 0xc2080340f8, 0x7f6e10c31968, 0xc2080334f0, 0x7f6e10c318c8, 0xc208036460, 0x82a9b0, 0x4)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/http_api/http_server.go:18 +0x26b
github.com/bitly/nsq/nsqd.func·032()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/nsqd.go:245 +0xf6
github.com/bitly/nsq/internal/util.func·001()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:14 +0x2f
created by github.com/bitly/nsq/internal/util.(*WaitGroupWrapper).Wrap
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:16 +0xf7

goroutine 9 [select]:
github.com/bitly/nsq/nsqd.(*NSQD).idPump(0xc208074200)
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/nsqd.go:519 +0x388
github.com/bitly/nsq/nsqd.func·033()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/nsqd/nsqd.go:248 +0x2a
github.com/bitly/nsq/internal/util.func·001()
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:14 +0x2f
created by github.com/bitly/nsq/internal/util.(*WaitGroupWrapper).Wrap
    /Users/mreiferson/dev/src/github.com/bitly/nsq/internal/util/wait_group_wrapper.go:16 +0xf7

I realise this could well be operator error but I'd appreciate a prod in the right direction - been banging my head against this all day.

Cheers

@ploxiln
Copy link
Member

ploxiln commented Aug 5, 2015

The problem is probably that the "tcp protocol" port of nsqlookupd is 4160, not 4161 (that's the "http protocol" port).

runtime: out of memory is a suspicious error. nsqd could probably handle the unexpected responses from nsqlookupd better.

@mreiferson
Copy link
Member

good catch @ploxiln

@samhogg it's possible that you've found a bug where nsqd is reading a length off the wire and by receiving garbage (HTTP) data it's allocating a really large buffer 🔥 🚒

@mreiferson mreiferson added the bug label Aug 5, 2015
@mreiferson mreiferson changed the title Out of memory issue nsqd: OOM when misconfiguring lookupd addresses Aug 5, 2015
@sam-silv
Copy link
Author

sam-silv commented Aug 5, 2015

Thanks!
Just figured that out myself when re-reading the docs, think I'm back on the right track now (I hope!)

@mreiferson
Copy link
Member

@mreiferson
Copy link
Member

RFR @jehiah

jehiah added a commit that referenced this pull request Oct 4, 2015
nsqd: OOM when misconfiguring lookupd addresses
@jehiah jehiah merged commit 27c4556 into nsqio:master Oct 4, 2015
@mreiferson mreiferson deleted the lookup_peer_oom_615 branch October 4, 2015 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants