ingest/ledgerbackend: Fix flaky Captive Core tests #3213

bartekn · 2020-11-10T23:52:42Z

This commit fixes two issues that cause Captive Core tests flaky:

It was possible that bufferedLedgerMetaReader go routine was still running after calling CaptiveStellarCore.Close() because it returned only when getting an exit signal from stellarCoreRunner (could happen after CaptiveStellarCore.Close()). To fix it a new WaitForClose() method was created that blocks until go routine returns.
bufferedLedgerMetaReader.readLedgerMetaFromPipe() was retuning error on Stellar-Core process graceful exit. Now it returns an error only when Stellar-Core was closed with an error.

I'm not very satisfied with a solutions here. I think we should research #3200 and automate this.

tamirms · 2020-11-11T11:17:10Z

ingest/ledgerbackend/captive_core_backend.go

+	if processErr != nil {
+		return errors.Wrap(processErr, "stellar-core process exited with an error")
+	}
+	return errors.New("stellar-core process exited unexpectedly without an error")


in the case of an offline replay, after streaming all the ledgers in the catchup range, we expect stellar-core to exit without error, right?

Good catch! I fixed it in 873797e and I was able to simplify GetLedger: it no longer needs to handle getProcessExitChan() because if Stellar-Core exits, bufferedLedgerMetaReader will send an error to the channel.

…tekn/go into fix-buffered-meta-pipe-reader-flacky

tamirms · 2020-11-13T17:53:53Z

ingest/ledgerbackend/buffered_meta_pipe_reader.go

+loop:
+	for {
+		select {
+		case <-b.c:


can this function be simplified to just <-b.closed ?

Unfortunately no and the reason is in the comment above:

// If buffer is full, keep reading to make sure it receives // a shutdown signal from stellarCoreRunner.

To make it more clear start go routine closes b.close only on error or after receiving process exit signal from Stellar-Core. However, the channel is buffered (up to 20 messages now) so in case it's full we need to read one message to unblock it so start can exit.

bartekn · 2020-11-13T20:29:49Z

ingest/ledgerbackend/stellar_core_runner.go

@@ -282,7 +282,6 @@ func (r *stellarCoreRunner) close() error {
 	if r.started {
 		close(r.shutdown)
 		r.wg.Wait()
-		close(r.processExit)


This was causing close of closed channel error because r.processExit is closed right after cmd.Wait().

ingest/ledgerbackend: Fix flaky Captive Core tests

c828f31

bartekn requested a review from a team November 10, 2020 23:52

cla-bot bot added the cla: yes label Nov 10, 2020

Merge branch 'master' into fix-buffered-meta-pipe-reader-flacky

4074452

tamirms reviewed Nov 11, 2020

View reviewed changes

bartekn added 5 commits November 12, 2020 16:49

Multiple fixes

873797e

Merge branch 'fix-buffered-meta-pipe-reader-flacky' of github.com:bar…

e3a4aa8

…tekn/go into fix-buffered-meta-pipe-reader-flacky

Merge branch 'master' into fix-buffered-meta-pipe-reader-flacky

7bc7ee4

Add comment

c3ca633

Merge branch 'fix-buffered-meta-pipe-reader-flacky' of github.com:bar…

777694d

…tekn/go into fix-buffered-meta-pipe-reader-flacky

tamirms reviewed Nov 13, 2020

View reviewed changes

Remove channel close

8ce4eb5

bartekn commented Nov 13, 2020

View reviewed changes

tamirms approved these changes Nov 14, 2020

View reviewed changes

bartekn merged commit f9ab742 into stellar:master Nov 16, 2020

bartekn deleted the fix-buffered-meta-pipe-reader-flacky branch November 16, 2020 11:03

bartekn mentioned this pull request Nov 17, 2020

Reliability issues in remote-core http server #3228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: Fix flaky Captive Core tests #3213

ingest/ledgerbackend: Fix flaky Captive Core tests #3213

bartekn commented Nov 10, 2020

tamirms Nov 11, 2020

bartekn Nov 12, 2020

tamirms Nov 13, 2020

bartekn Nov 13, 2020

bartekn Nov 13, 2020

ingest/ledgerbackend: Fix flaky Captive Core tests #3213

ingest/ledgerbackend: Fix flaky Captive Core tests #3213

Conversation

bartekn commented Nov 10, 2020

tamirms Nov 11, 2020

Choose a reason for hiding this comment

bartekn Nov 12, 2020

Choose a reason for hiding this comment

tamirms Nov 13, 2020

Choose a reason for hiding this comment

bartekn Nov 13, 2020

Choose a reason for hiding this comment

bartekn Nov 13, 2020

Choose a reason for hiding this comment