Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerpilot spawns many threads (LWP) #490

Closed
roengram opened this issue Aug 22, 2017 · 7 comments
Closed

containerpilot spawns many threads (LWP) #490

roengram opened this issue Aug 22, 2017 · 7 comments

Comments

@roengram
Copy link

containerpilot version: 2.7.7
cloud: samsung private cloud (based on JPC)

I noticed that containerpilot process is creating many LWPs.
Below is the output of /native/usr/bin/prstat

  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP         
 32716 root     1217M 1186M sleep   59    0  11:28:25 0.1% prometheus/65
 32715 root       82M   65M sleep   59    0   1:10:21 0.0% griffin-log-col/63
 32722 root       52M   32M sleep   59    0   1:27:03 0.0% consul/67
 30693 root      247M  196M sleep   54    0   0:01:02 0.0% java/228
 70092 root        0K    0K zombie   0    -   0:00:00 0.0% /0
 37293 root       14M 3536K sleep   59    0   0:00:00 0.0% bash/1
 84507 root       14M 3560K sleep   59    0   0:00:00 0.0% bash/1
 38034 root       14M 3552K sleep   59    0   0:00:00 0.0% bash/1
 76483 root       14M 3588K sleep   59    0   0:00:00 0.0% bash/1
 74903 root       25M 3552K sleep   43    0   0:00:00 0.0% vi/1
 30676 root       14M  408K sleep   59    0   0:00:00 0.0% run-griffin-age/1
 84546 root     4340K 3472K cpu14    1    0   0:00:00 0.0% prstat/1
 30679 root       14M  404K sleep   59    0   0:00:00 0.0% bash/1
 69399 root       14M 3568K sleep   45    0   0:00:00 0.0% bash/1
 32714 root       14M  392K sleep   59    0   0:00:00 0.0% run.sh/1
 32138 root       23M   10M sleep    1    0   0:12:45 0.0% containerpilot/28
 59901 root       14M 3592K sleep   59    0   0:00:00 0.0% bash/1
 42998 root        0K    0K zombie   0    -   0:00:00 0.0% /0
 69398 root       14M 2288K sleep   59    0   0:00:00 0.0% bash/1
 31778 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1
 31874 root      268M  254M sleep   59    0   1:14:37 0.0% containerpilot/3520

Total: 21 processes, 3984 lwps, load averages: 0.83, 1.44, 1.49

There's a containerpilot process at the bottom, which has 3520 LWPs. We discovered this issue while investigating the reason of "resource temporarily unavailable" when forking a new process. Note that even though the problematic containerpilot's PID is 31874 on prstat, it's actually PID 1. The process also uses quite a large amount of memory 254M, probably because of threads stack.

@tgross
Copy link
Contributor

tgross commented Aug 22, 2017

Note that even though the problematic containerpilot's PID is 31874 on prstat, it's actually PID 1.

You've got 2 ContainerPilot processes, as we'd expect. Based on the TIME value it looks like 31874 is the supervisor (which should be PID1 as you say). That should give us a direction to look -- the problem is almost certainly in this file.

32138 root       23M   10M sleep    1    0   0:12:45 0.0% containerpilot/28
31874 root      268M  254M sleep   59    0   1:14:37 0.0% containerpilot/3520

@tgross
Copy link
Contributor

tgross commented Aug 22, 2017

I'm assuming this is a LX Brand zone or Docker on sdc-docker container? I'd be really interested to see the output of the process tree from the perspective of Linux via ps -afx

@tgross
Copy link
Contributor

tgross commented Aug 22, 2017

Right now the only place where I see an opportunity for a resource leak in that code section is where we spawn a goroutine to reap child processes (see sup.go#L45). I'm not sure I can see a case in the handler where we'd find ourselves in the kind of infinite loop that would cause these goroutines to be persistent. But we should probably move the entire signal handler for SIGCHLD into its own goroutine so we at most only ever have one goroutine. We'll want to have this in v3 as well.

@tgross
Copy link
Contributor

tgross commented Aug 22, 2017

I've opened #493 for v3 and #492 for v2.

@roengram
Copy link
Author

The container was launched using sdc-docker. And although the container already died, I could get some docker logs from it, probably confirming your suspicion about the file.

Here are snippet:

goroutine 1090 [syscall, 879 minutes]:
syscall.Syscall6(0x3d, 0xffffffffffffffff, 0xc822f76f3c, 0x0, 0x0, 0x0, 0x0, 0x6a2e, 0x0, 0x0)
    /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.wait4(0xffffffffffffffff, 0xc822f76f3c, 0x0, 0x0, 0x6a2e, 0x0, 0x0)
    /usr/local/go/src/syscall/zsyscall_linux_amd64.go:172 +0x7f 
syscall.Wait4(0xffffffffffffffff, 0xc822f76f84, 0x0, 0x0, 0x6a2e, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall_linux.go:256 +0x55 
github.com/joyent/containerpilot/sup.reap()
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:56 +0x48 
created by github.com/joyent/containerpilot/sup.handleSignals.func1
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:45 +0x271

goroutine 1091 [syscall, 913 minutes]:
syscall.Syscall6(0x3d, 0xffffffffffffffff, 0xc822f7773c, 0x0, 0x0, 0x0, 0x0, 0x172b6, 0x0, 0x0)
    /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.wait4(0xffffffffffffffff, 0xc822f7773c, 0x0, 0x0, 0x172b6, 0x0, 0x0)
    /usr/local/go/src/syscall/zsyscall_linux_amd64.go:172 +0x7f 
syscall.Wait4(0xffffffffffffffff, 0xc822f77784, 0x0, 0x0, 0x172b6, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall_linux.go:256 +0x55 
github.com/joyent/containerpilot/sup.reap()
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:56 +0x48 
created by github.com/joyent/containerpilot/sup.handleSignals.func1
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:45 +0x271

goroutine 1092 [syscall, 830 minutes]:
syscall.Syscall6(0x3d, 0xffffffffffffffff, 0xc822f77f3c, 0x0, 0x0, 0x0, 0x0, 0x11c3e, 0x0, 0x0)
    /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.wait4(0xffffffffffffffff, 0xc822f77f3c, 0x0, 0x0, 0x11c3e, 0x0, 0x0)
    /usr/local/go/src/syscall/zsyscall_linux_amd64.go:172 +0x7f 
syscall.Wait4(0xffffffffffffffff, 0xc822f77f84, 0x0, 0x0, 0x11c3e, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall_linux.go:256 +0x55 
github.com/joyent/containerpilot/sup.reap()
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:56 +0x48 
created by github.com/joyent/containerpilot/sup.handleSignals.func1
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:45 +0x271

goroutine 1093 [syscall, 819 minutes]:
syscall.Syscall6(0x3d, 0xffffffffffffffff, 0xc82302873c, 0x0, 0x0, 0x0, 0x0, 0x140f2, 0x0, 0x0)
    /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.wait4(0xffffffffffffffff, 0xc82302873c, 0x0, 0x0, 0x140f2, 0x0, 0x0)
    /usr/local/go/src/syscall/zsyscall_linux_amd64.go:172 +0x7f 
syscall.Wait4(0xffffffffffffffff, 0xc823028784, 0x0, 0x0, 0x140f2, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall_linux.go:256 +0x55 
github.com/joyent/containerpilot/sup.reap()
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:56 +0x48 
created by github.com/joyent/containerpilot/sup.handleSignals.func1
    /cp/src/github.com/joyent/containerpilot/sup/sup.go:45 +0x271
...

The logs are identical to all goroutines that are logged. There are 2539 goroutines in total in the log, but I'm sure logs for other 1000 goroutines disappeared because of log rotation.

This is the entire log file: log.txt

@jwreagor
Copy link
Contributor

jwreagor commented Aug 23, 2017

@roengram Thanks for the great piece of information. This is indeed the bug that was fixed by @tgross earlier today. We'll have this release out tomorrow AM (EDT).

@jwreagor
Copy link
Contributor

Your patch was released in both 2.7.8 and 3.4.2. Let us know if it resolves your issue and we'll re-open this issue if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants