ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258

bartekn · 2020-11-27T18:37:10Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

Update shutdown code to handle user initiated shutdown. Update shutdown code and go routines to use tomb.

Why

Shutdown code was getting too complicated. We were missing some obvious cases (like user initiated shutdown), run multiple times into problems connected to shutdown (ex. closing a closed channel) and in general code was messy. In #3200 I found tomb and decided to give it a try. It has simple yet powerful API that covers sending kill signals, wait for a go routine to return and handling errors. The new code is much clearer than my first attempt in f876d60 that requires three extra methods on stellarCoreRunner. It makes methods like bufferedLedgerMetaReader.waitForClose (renamed to Close as it now allows user initiated close too) easier to understand. Finally, tests are also easier to write.

Known limitations

Let me know if you like tomb. If it's OK it would be great to use it in Horizon.

…exit during catchup (#3260) Fixes a bug introduced in a10c000 because of which `PrepareRange` and `GetLedger` methods could return an error after Stellar-Core process exit but before all ledgers are read from the buffer. To fix it we now handle process exit in `bufferedLedgerMetaReader` only and only in case of errors. In `PrepareRange` we return error only on Stellar-Core exit with error. This won't work with user initiated shutdown, it will be fixed in #3258.

tamirms · 2020-12-03T19:59:30Z

@bartekn looks like there's a v2 of tomb https://github.com/go-tomb/tomb/tree/v2

bartekn · 2020-12-03T20:00:52Z

@bartekn looks like there's a v2 of tomb https://github.com/go-tomb/tomb/tree/v2

Yes, I saw that but we already have v1 in our deps and v2 doesn't add anything that we need here. We may upgrade in the future if necessary.

ingest/ledgerbackend/captive_core_backend.go

tamirms · 2020-12-08T11:45:58Z

@bartekn I think there are potentially thread safety issues if Close() is called concurrently with GetLedger() or PrepareRange(). If we allow user initiated shutdown, Close() can be called at any time regardless of the state of the ingestion system

bartekn · 2020-12-08T11:55:38Z

@bartekn I think there are potentially thread safety issues if Close() is called concurrently with GetLedger() or PrepareRange(). If we allow user initiated shutdown, Close() can be called at any time regardless of the state of the ingestion system

I had this in mind when working on this PR. If you see any issues connected to calling Close() when PrepareRange() or GetLedger() is running can you add comments inline?

tamirms · 2020-12-08T12:00:20Z

ingest/ledgerbackend/captive_core_backend.go

@@ -474,7 +490,7 @@ loop:
 			return false, xdr.LedgerCloseMeta{}, nil


If Close() is called concurrently with this function is there a possible race condition where the ledger buffer is set to nil right before it is accessed here?

Good catch, I focused on shutdown code and forgot about the obvious things. Fixed in 648ebf5.

bartekn · 2020-12-08T15:58:44Z

I run the code again after recent changes to ensure it's working ok but I noticed a very slow catchup stage. I fixed it in 6acf8f8. The problem was connected to the fact that only one ledger was fetched from the buffer in PrepareRange, I changed it to empty the buffer and then sleep (also, decreased sleep time from 1s to 100ms).

tamirms · 2020-12-08T16:01:25Z

@bartekn I think the code is thread-safe now. the only issue I see is that calling Close() may take a long time because it will block on acquiring the lock until PrepareRange() completes

bartekn · 2020-12-08T23:56:18Z

the only issue I see is that calling Close() may take a long time because it will block on acquiring the lock until PrepareRange() completes

Good call again, fixed in 0505023.

tamirms · 2020-12-09T10:32:43Z

@bartekn I think there is another scenario which is not covered. Consider if the ingestion system's shutdown method is called right before the state machine go routine calls PrepareRange(). I think there's a possible race condition where the ingestion system shutdown method calls Close() right before the state machine go routine calls PrepareRange(). While in PrepareRange() the function continues to execute as normal and it would be unaware that the system is shutting down.

bartekn · 2020-12-09T11:06:48Z

@tamirms you are right but except an obvious idea which is disallow reusing CaptiveStellarCore after closing that would require a larger refactor I'm not sure how to solve this. I'd vote for creating an issue for this and other issues connected to shutdown and fix them after beta release. What do you think?

tamirms · 2020-12-09T11:14:05Z

@bartekn sounds good, I think I have an idea on how to tackle it but I think I'd need a day or two to verify the idea. Fixing it after the beta release makes sense

2opremio · 2020-12-09T14:58:13Z

ingest/ledgerbackend/captive_core_backend.go

 		}
+
 		time.Sleep(c.waitIntervalPrepareRange)


Do we really need an explicit sleep here?

I would (at the very least) use a ticker and intertwine the check with the other channels in the select statement.

Even better, It may even be possible to remove the wait and condition by using an asynchronous go-routine for fast-forwarding.

It may seem outside of the scope of this PR, but it affects the shutdown wait time.

We can try it out for sure! Can you create a new issue? I'm not sure if we'll be able to test it before beta.

2opremio · 2020-12-09T15:00:30Z

ingest/ledgerbackend/captive_core_backend.go

-		if len(c.ledgerBuffer.getChannel()) > 0 {
-			break
+		// Wait/fast-forward to the expected ledger or an error. We need to check
+		// buffer length because `GetLedger` may be blocking.


I think it would make sense to add a context to GetLedger() which we can cancel on shutdown.

I really wanted to avoid using context.WithCancel here because it doesn't provide two things:

wait until cancel request has been handled so the caller can be sure that it's safe to exit (like: core process exit, files removed, or go routine in a buffer returns),

easy way to pass a final error value if any.

tomb provides mechanism for both and because of this: simplifies entire process. I'm open for discussion, can we talk in a new issue?

Sure. I will open a new issue.

bartekn

@Shaptic you asked for some comments explaining how this code works. I recommend reading this article (I know you did) but as requested added a few comments. As I mentioned in the other comment, the main advantage of tomb is that it allows sending cancellation signal but also allow propagating errors back to the caller (and original error is not overwritten, this is useful when there is user-initiated shutdown so you can send Cancelled error to detect it) and allows waiting for the go routine to actually return. Obviously it's possible to built similar solution using channels, wait groups but tomb gives this out of the box. Read on...

bartekn · 2020-12-09T18:50:47Z

ingest/ledgerbackend/stellar_core_runner.go

+		// Kill tomb with context.Canceled. Kill will be called again in start()
+		// when process exit is handled but the error value will not be overwritten.
+		r.tomb.Kill(context.Canceled)


As mentioned in summary above there are two ways we can kill Stellar-Core process. The first option is here. This is user-initiated shutdown and we can determine it later in caller (CaptiveStellarCore in our case) by checking if the error (tomb.Err()) is equal context.Canceled. The nice property of tomb is that the error is not overwritten. So if Kill is called again later we will still know it's user init'd shutdown.

The answer to this is probably painfully-obvious, but by "user-initiated shutdown," we generally mean something akin to Ctrl+C and/or via a code path (e.g. calling app.Close()) right?

bartekn · 2020-12-09T18:52:06Z

ingest/ledgerbackend/stellar_core_runner_windows.go

+		c.tomb.Kill(c.cmd.Wait())
+		c.tomb.Done()


Second option is that Stellar-Core exited itself. We pass the error to caller using Kill again. It's possible that there is no error an Stellar-Core exited gracefully but please note that this exit may be unexpected in some cases so it may still trigger an error. Also, this is one of the features of tomb that's missing in other solutions. It has an easy way to propagate error back to the caller.

bartekn · 2020-12-09T18:54:08Z

ingest/ledgerbackend/captive_core_backend.go

+			processErr := c.stellarCoreRunner.getTomb().Err()
+			switch {
+			case processErr == nil && !ledgerRange.bounded:
+				return errors.New("stellar-core process exited unexpectedly without an error")
+			case processErr == nil && ledgerRange.bounded:
+				return nil
+			case processErr == context.Canceled:
+				return processErr
+			default:
 				return errors.Wrap(processErr, "stellar-core process exited with an error")
 			}


Now, this is where we actually try to handler the Stellar-Core process exit. I think that all the cases are explained pretty well in a comment above. One thing I wanted to note is what I already described in the previous comment. Please note that the first case handles the case in which the error is nil but we still return an error because it was unexpected.

bartekn · 2020-12-09T18:58:04Z

ingest/ledgerbackend/stellar_core_runner.go

-		close(r.shutdown)
-		r.wg.Wait()
+	if r.tomb != nil {
+		r.tomb.Wait()


I just wanted to briefly describe the last property of tomb which is waiting for the go routine to return. Obviously you could achieve the same thing using channels or wait group but this works really well with other tomb API methods.

bartekn · 2020-12-09T18:59:27Z

ingest/ledgerbackend/buffered_meta_pipe_reader.go

+
+		select {
+		case b.c <- metaResult{meta, err}:
+		case <-b.tomb.Dying():


Finally, there are several events we can listen to. In this case Dying is when Kill was called but it's possible that tomb is not Dead yet. We can use this signal to start shutting down go routine.

Shaptic · 2020-12-10T23:47:18Z

ingest/ledgerbackend/captive_core_backend.go

 	// Range already prepared
-	if prepared, err := c.IsPrepared(ledgerRange); err != nil {
+	if prepared, err := c.isPrepared(ledgerRange); err != nil {
+		c.mutex.Unlock()


Would it be safer / cleaner to do something like

defer func() { if !unlocked { c.mutex.Unlock() unlocked = true } }

instead (and set unlocked=true below before the waitloop)? Or, alternatively, moving the waitloop to its own function would let us just defer c.mutex.Unlock() directly here.

It just seems like it could be really easy for someone (like me lol) to edit this code in the future, then forget to unlock the mutex and be in a dangerous state.

bartekn · 2020-12-14T10:12:13Z

Closing in favour of: #3278.

…evious instance termination (#4020) Add a code to `CaptiveCoreBackend.startPreparingRange` that ensures that previously started Stellar-Core instance is not running: check if `getProcessExitError` returns `true` which means that Stellar-Core process is fully terminated. This prevents a situation in which a new instance is started and clashes with the previous one. The existing code contains a bug, likely introduced in 0f2d08b. The context returned by `stellarCoreRunner.context()` is cancelled in `stellarCoreRunner.close()` which initiates the termination process. At the same time, `CaptiveCoreBackend.PrepareRange()` calls `CaptiveCoreBackend.isClosed()` internally that checks which return value depends on `stellarCoreRunner` context being closed. This is wrong because it's possible that Stellar-Core is still not closed even when aforementioned context is cancelled - it can be still closing so the process can be still running. Because of this the following chain of events can lead to two Stellar-Core instances running (briefly) at the same time: 1. Stellar-Core instance is upgraded, triggering `fileWatcher` to call `stellarCoreRunner.close()` which cancels the `stellarCoreRunner.context()`. 2. In another Go routine, `CaptiveBackend.IsPrepared()` is called, which returns `false` because `stellarCoreRunner.context()` is canceled and then calls `CaptiveBackend.PrepareRange()` to restart Stellar-Core. `PrepareRange()` also checks if `stellarCoreRunner.context()` is cancelled (it is but Stellar-Core process can still run shutdown procedure) and then attempts to start a new instance. This commit is really a quick fix. Code before 0f2d08b was simpler because it was calling `Kill()` on a process so "terminating" and "terminated" were exactly the same state. After 0f2d08b there are now two events associated with a Stellar-Core process (as above). Because of this the code requires a larger refactoring. We may reconsider using `tomb` package I tried in #3258 that was later closed in favour of: #3278.

ingest/ledgerbackend: Handle user initiated shutdown during catchup

f876d60

cla-bot bot added the cla: yes label Nov 27, 2020

bartekn mentioned this pull request Dec 1, 2020

ingest/ledgerbackend: Remove returning error on Stellar-Core process exit during catchup #3260

Merged

7 tasks

bartekn added 3 commits December 2, 2020 17:21

Merge branch 'master' into core-runner-user-initiated-shutdown

61f93b3

tomb

b467745

Close backend in tests

5baa384

bartekn marked this pull request as ready for review December 2, 2020 23:21

bartekn requested a review from a team December 2, 2020 23:21

This was referenced Dec 4, 2020

services/horizon/ingest: Close LedgerBackend in Shutdown #3266

Closed

Captive Core leaks zombie gzip and curl processes #3217

Closed

tamirms reviewed Dec 7, 2020

View reviewed changes

ingest/ledgerbackend/captive_core_backend.go Show resolved Hide resolved

bartekn added 2 commits December 8, 2020 12:46

Move wait loop to PrepareRange

570446a

Merge branch 'master' into core-runner-user-initiated-shutdown

4ad9ce7

tamirms reviewed Dec 8, 2020

View reviewed changes

bartekn added 4 commits December 8, 2020 15:11

Add mutex

648ebf5

Update comment

eddef27

Merge branch 'master' into core-runner-user-initiated-shutdown

acedd53

Fix slow catchup

6acf8f8

bartekn added 2 commits December 9, 2020 00:14

Merge branch 'master' into core-runner-user-initiated-shutdown

f00c19c

Allow Close() while PrepareRange() running

0505023

bartekn added 2 commits December 9, 2020 01:04

Fix race in tests

cdd8c09

fix go.mod

98732e9

2opremio reviewed Dec 9, 2020

View reviewed changes

bartekn commented Dec 9, 2020

View reviewed changes

tamirms mentioned this pull request Dec 10, 2020

ingest/ledgerbackend: Use context to handle termination and cleanup of captive core #3278

Merged

7 tasks

Shaptic reviewed Dec 10, 2020

View reviewed changes

bartekn closed this Dec 14, 2020

bartekn mentioned this pull request Oct 20, 2021

ingest/ledgerbackend: Make sure Stellar-Core is not started before previous instance termination #4020

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258

ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258

bartekn commented Nov 27, 2020 •

edited

Loading

tamirms commented Dec 3, 2020

bartekn commented Dec 3, 2020

tamirms commented Dec 8, 2020

bartekn commented Dec 8, 2020

tamirms Dec 8, 2020

bartekn Dec 8, 2020

bartekn commented Dec 8, 2020

tamirms commented Dec 8, 2020

bartekn commented Dec 8, 2020

tamirms commented Dec 9, 2020

bartekn commented Dec 9, 2020

tamirms commented Dec 9, 2020 •

edited

Loading

2opremio Dec 9, 2020

bartekn Dec 9, 2020

2opremio Dec 10, 2020

2opremio Dec 9, 2020

bartekn Dec 9, 2020

2opremio Dec 10, 2020

bartekn left a comment

bartekn Dec 9, 2020

Shaptic Dec 10, 2020

bartekn Dec 9, 2020

bartekn Dec 9, 2020

bartekn Dec 9, 2020

bartekn Dec 9, 2020

Shaptic Dec 10, 2020 •

edited

Loading

bartekn commented Dec 14, 2020

		@@ -474,7 +490,7 @@ loop:
		return false, xdr.LedgerCloseMeta{}, nil

ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258

ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258

Conversation

bartekn commented Nov 27, 2020 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

tamirms commented Dec 3, 2020

bartekn commented Dec 3, 2020

tamirms commented Dec 8, 2020

bartekn commented Dec 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartekn commented Dec 8, 2020

tamirms commented Dec 8, 2020

bartekn commented Dec 8, 2020

tamirms commented Dec 9, 2020

bartekn commented Dec 9, 2020

tamirms commented Dec 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartekn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shaptic Dec 10, 2020 • edited Loading

Choose a reason for hiding this comment

bartekn commented Dec 14, 2020

bartekn commented Nov 27, 2020 •

edited

Loading

tamirms commented Dec 9, 2020 •

edited

Loading

Shaptic Dec 10, 2020 •

edited

Loading