Reliability issues in remote-core http server #3228

2opremio · 2020-11-17T19:21:25Z

While chaos-monkey-testing captive-core in the remote http configuration I killed the core child process to simulate a crash.

Instead of respawning the core process, the captive core http server:

Panicked due to the channel double-close (See backlog at the end of the description)
Got stuck, which is more worrisome. Instead of dying, the captive core server process lingered, not giving an opportunity to the supervisor (Kubernetes in this case) to re-spawn it:

root@horizon-with-remote-core-6bddf785b9-r9vlg:/# killall stellar-core
root@horizon-with-remote-core-6bddf785b9-r9vlg:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1 20.5  1.5 1539976 32040 ?       Ssl  18:13  12:26 /captivecore --port=8080
root      7161  0.0  0.1  18504  3372 pts/0    Ss   19:11   0:00 bash
root      7220  0.0  0.1  34400  2792 pts/0    R+   19:14   0:00 ps aux
root@horizon-with-remote-core-6bddf785b9-r9vlg:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1 18.6  1.3 1539976 28016 ?       Ssl  18:13  12:27 /captivecore --port=8080
root      7161  0.0  0.1  18504  3372 pts/0    Ss   19:11   0:00 bash
root      7221  0.0  0.1  34400  2748 pts/0    R+   19:20   0:00 ps aux
root@horizon-with-remote-core-6bddf785b9-r9vlg:/#

Panic log:

Panic: close of closed channel
goroutine 418366 [running]:
runtime/debug.Stack(0x1f, 0x0, 0x0)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/go-chi/chi/middleware.Recoverer.func1.1(0xc001431600, 0xf89860, 0xc001450380)
	/go/pkg/mod/github.com/go-chi/chi@v4.0.3+incompatible/middleware/recoverer.go:28 +0x1e3
panic(0xc70f60, 0xf61e60)
	/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/stellar/go/ingest/ledgerbackend.(*stellarCoreRunner).close(0xc00011e0c0, 0x40ec08, 0x20)
	/go/src/github.com/stellar/go/ingest/ledgerbackend/stellar_core_runner.go:285 +0x192
github.com/stellar/go/ingest/ledgerbackend.(*CaptiveStellarCore).Close(0xc0003e8180, 0xc0014bffa0, 0x0)
	/go/src/github.com/stellar/go/ingest/ledgerbackend/captive_core_backend.go:502 +0x77
github.com/stellar/go/ingest/ledgerbackend.(*CaptiveStellarCore).GetLedger(0xc0003e8180, 0xc00004e6da, 0xb95222, 0xc0014bd980, 0xd70a63, 0xc, 0x0)
	/go/src/github.com/stellar/go/ingest/ledgerbackend/captive_core_backend.go:467 +0x552
github.com/stellar/go/exp/services/captivecore/internal.(*CaptiveCoreAPI).GetLedger(0xc0003ec040, 0x4e6da, 0xc001283e00, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/stellar/go/exp/services/captivecore/internal/api.go:170 +0xbf
github.com/stellar/go/exp/services/captivecore/internal.Handler.func2(0x7fafe80781b8, 0xc0000e2d00, 0xc001431800)
	/go/src/github.com/stellar/go/exp/services/captivecore/internal/server.go:54 +0x152
net/http.HandlerFunc.ServeHTTP(0xc0003ea520, 0x7fafe80781b8, 0xc0000e2d00, 0xc001431800)
	/usr/local/go/src/net/http/server.go:2041 +0x44
github.com/go-chi/chi.(*Mux).routeHTTP(0xc0000bde00, 0x7fafe80781b8, 0xc0000e2d00, 0xc001431800)
	/go/pkg/mod/github.com/go-chi/chi@v4.0.3+incompatible/mux.go:425 +0x278
net/http.HandlerFunc.ServeHTTP(0xc0003ea510, 0x7fafe80781b8, 0xc0000e2d00, 0xc001431800)
	/usr/local/go/src/net/http/server.go:2041 +0x44
github.com/stellar/go/support/http.LoggingMiddleware.func1(0xf89860, 0xc001450380, 0xc001431800)
	/go/src/github.com/stellar/go/support/http/logging_middleware.go:40 +0x392
net/http.HandlerFunc.ServeHTTP(0xc0003ca620, 0xf89860, 0xc001450380, 0xc001431700)
	/usr/local/go/src/net/http/server.go:2041 +0x44
github.com/stellar/go/support/http.SetLoggerMiddleware.func1.1(0xf89860, 0xc001450380, 0xc001431600)
	/go/src/github.com/stellar/go/support/http/logging_middleware.go:20 +0x16c
net/http.HandlerFunc.ServeHTTP(0xc0003ca640, 0xf89860, 0xc001450380, 0xc001431600)
	/usr/local/go/src/net/http/server.go:2041 +0x44
github.com/go-chi/chi/middleware.Recoverer.func1(0xf89860, 0xc001450380, 0xc001431600)
	/go/pkg/mod/github.com/go-chi/chi@v4.0.3+incompatible/middleware/recoverer.go:35 +0x83
net/http.HandlerFunc.ServeHTTP(0xc0003ca660, 0xf89860, 0xc001450380, 0xc001431600)
	/usr/local/go/src/net/http/server.go:2041 +0x44
github.com/go-chi/chi/middleware.RequestID.func1(0xf89860, 0xc001450380, 0xc001431500)
	/go/pkg/mod/github.com/go-chi/chi@v4.0.3+incompatible/middleware/request_id.go:76 +0x1df
net/http.HandlerFunc.ServeHTTP(0xc0003ca680, 0xf89860, 0xc001450380, 0xc001431500)
	/usr/local/go/src/net/http/server.go:2041 +0x44
github.com/go-chi/chi.(*Mux).ServeHTTP(0xc0000bde00, 0xf89860, 0xc001450380, 0xc001431400)
	/go/pkg/mod/github.com/go-chi/chi@v4.0.3+incompatible/mux.go:82 +0x2b2
net/http.serverHandler.ServeHTTP(0xc0001fa0e0, 0xf89860, 0xc001450380, 0xc001431400)
	/usr/local/go/src/net/http/server.go:2836 +0xa3
net/http.(*conn).serve(0xc001a28960, 0xf8d5a0, 0xc0003ed7c0)
	/usr/local/go/src/net/http/server.go:1924 +0x86c
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2962 +0x35c

The text was updated successfully, but these errors were encountered:

bartekn · 2020-11-17T19:23:08Z

Can you try it again on master or release-horizon-v1.12.0? I believe I fixed it in: #3213 (at least close of closed channel issue).

2opremio · 2020-11-17T19:43:21Z

Uhm, unfortunately remote-captive-core seems to be broken in master (b361462).

I will create a separate ticket for that.

2opremio · 2020-11-17T20:06:14Z

Done: #3230

2opremio added the fast-txmeta label Nov 17, 2020

2opremio assigned tamirms Nov 17, 2020

2opremio added horizon bug labels Nov 17, 2020

2opremio mentioned this issue Nov 17, 2020

Confirm crash/restart works reliably during online mode #2611

Closed

2opremio mentioned this issue Nov 17, 2020

Remote captive core fails to start #3230

Closed

tamirms mentioned this issue Nov 18, 2020

exp/services/captivecore: Captive Core API fixes #3232

Merged

7 tasks

2opremio closed this as completed Nov 18, 2020

tamirms mentioned this issue Dec 10, 2020

ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability issues in remote-core http server #3228

Reliability issues in remote-core http server #3228

2opremio commented Nov 17, 2020

bartekn commented Nov 17, 2020 •

edited

Loading

2opremio commented Nov 17, 2020

2opremio commented Nov 17, 2020

Reliability issues in remote-core http server #3228

Reliability issues in remote-core http server #3228

Comments

2opremio commented Nov 17, 2020

bartekn commented Nov 17, 2020 • edited Loading

2opremio commented Nov 17, 2020

2opremio commented Nov 17, 2020

bartekn commented Nov 17, 2020 •

edited

Loading