-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258
ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258
Conversation
…exit during catchup (#3260) Fixes a bug introduced in a10c000 because of which `PrepareRange` and `GetLedger` methods could return an error after Stellar-Core process exit but before all ledgers are read from the buffer. To fix it we now handle process exit in `bufferedLedgerMetaReader` only and only in case of errors. In `PrepareRange` we return error only on Stellar-Core exit with error. This won't work with user initiated shutdown, it will be fixed in #3258.
@bartekn looks like there's a v2 of tomb https://github.com/go-tomb/tomb/tree/v2 |
Yes, I saw that but we already have |
@bartekn I think there are potentially thread safety issues if Close() is called concurrently with GetLedger() or PrepareRange(). If we allow user initiated shutdown, Close() can be called at any time regardless of the state of the ingestion system |
I had this in mind when working on this PR. If you see any issues connected to calling |
@@ -474,7 +490,7 @@ loop: | |||
return false, xdr.LedgerCloseMeta{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Close() is called concurrently with this function is there a possible race condition where the ledger buffer is set to nil right before it is accessed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I focused on shutdown code and forgot about the obvious things. Fixed in 648ebf5.
I run the code again after recent changes to ensure it's working ok but I noticed a very slow catchup stage. I fixed it in 6acf8f8. The problem was connected to the fact that only one ledger was fetched from the buffer in |
@bartekn I think the code is thread-safe now. the only issue I see is that calling |
Good call again, fixed in 0505023. |
@bartekn I think there is another scenario which is not covered. Consider if the ingestion system's shutdown method is called right before the state machine go routine calls |
@tamirms you are right but except an obvious idea which is disallow reusing |
@bartekn sounds good, I think I have an idea on how to tackle it but I think I'd need a day or two to verify the idea. Fixing it after the beta release makes sense |
} | ||
|
||
time.Sleep(c.waitIntervalPrepareRange) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need an explicit sleep here?
I would (at the very least) use a ticker and intertwine the check with the other channels in the select
statement.
Even better, It may even be possible to remove the wait and condition by using an asynchronous go-routine for fast-forwarding.
It may seem outside of the scope of this PR, but it affects the shutdown wait time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try it out for sure! Can you create a new issue? I'm not sure if we'll be able to test it before beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
if len(c.ledgerBuffer.getChannel()) > 0 { | ||
break | ||
// Wait/fast-forward to the expected ledger or an error. We need to check | ||
// buffer length because `GetLedger` may be blocking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to add a context to GetLedger()
which we can cancel on shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really wanted to avoid using context.WithCancel
here because it doesn't provide two things:
- wait until cancel request has been handled so the caller can be sure that it's safe to exit (like: core process exit, files removed, or go routine in a buffer returns),
- easy way to pass a final error value if any.
tomb
provides mechanism for both and because of this: simplifies entire process. I'm open for discussion, can we talk in a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will open a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shaptic you asked for some comments explaining how this code works. I recommend reading this article (I know you did) but as requested added a few comments. As I mentioned in the other comment, the main advantage of tomb
is that it allows sending cancellation signal but also allow propagating errors back to the caller (and original error is not overwritten, this is useful when there is user-initiated shutdown so you can send Cancelled
error to detect it) and allows waiting for the go routine to actually return. Obviously it's possible to built similar solution using channels, wait groups but tomb
gives this out of the box. Read on...
// Kill tomb with context.Canceled. Kill will be called again in start() | ||
// when process exit is handled but the error value will not be overwritten. | ||
r.tomb.Kill(context.Canceled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As mentioned in summary above there are two ways we can kill Stellar-Core process. The first option is here. This is user-initiated shutdown and we can determine it later in caller (
CaptiveStellarCore
in our case) by checking if the error (tomb.Err()
) is equalcontext.Canceled
. The nice property oftomb
is that the error is not overwritten. So ifKill
is called again later we will still know it's user init'd shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer to this is probably painfully-obvious, but by "user-initiated shutdown," we generally mean something akin to Ctrl+C and/or via a code path (e.g. calling app.Close()
) right?
c.tomb.Kill(c.cmd.Wait()) | ||
c.tomb.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Second option is that Stellar-Core exited itself. We pass the error to caller using
Kill
again. It's possible that there is no error an Stellar-Core exited gracefully but please note that this exit may be unexpected in some cases so it may still trigger an error. Also, this is one of the features oftomb
that's missing in other solutions. It has an easy way to propagate error back to the caller.
processErr := c.stellarCoreRunner.getTomb().Err() | ||
switch { | ||
case processErr == nil && !ledgerRange.bounded: | ||
return errors.New("stellar-core process exited unexpectedly without an error") | ||
case processErr == nil && ledgerRange.bounded: | ||
return nil | ||
case processErr == context.Canceled: | ||
return processErr | ||
default: | ||
return errors.Wrap(processErr, "stellar-core process exited with an error") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Now, this is where we actually try to handler the Stellar-Core process exit. I think that all the cases are explained pretty well in a comment above. One thing I wanted to note is what I already described in the previous comment. Please note that the first
case
handles the case in which the error isnil
but we still return an error because it was unexpected.
close(r.shutdown) | ||
r.wg.Wait() | ||
if r.tomb != nil { | ||
r.tomb.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I just wanted to briefly describe the last property of
tomb
which is waiting for the go routine to return. Obviously you could achieve the same thing using channels or wait group but this works really well with othertomb
API methods.
|
||
select { | ||
case b.c <- metaResult{meta, err}: | ||
case <-b.tomb.Dying(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Finally, there are several events we can listen to. In this case
Dying
is whenKill
was called but it's possible thattomb
is notDead
yet. We can use this signal to start shutting down go routine.
// Range already prepared | ||
if prepared, err := c.IsPrepared(ledgerRange); err != nil { | ||
if prepared, err := c.isPrepared(ledgerRange); err != nil { | ||
c.mutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safer / cleaner to do something like
defer func() {
if !unlocked {
c.mutex.Unlock()
unlocked = true
}
}
instead (and set unlocked=true
below before the waitloop
)? Or, alternatively, moving the waitloop to its own function would let us just defer c.mutex.Unlock()
directly here.
It just seems like it could be really easy for someone (like me lol) to edit this code in the future, then forget to unlock the mutex and be in a dangerous state.
Closing in favour of: #3278. |
…evious instance termination (#4020) Add a code to `CaptiveCoreBackend.startPreparingRange` that ensures that previously started Stellar-Core instance is not running: check if `getProcessExitError` returns `true` which means that Stellar-Core process is fully terminated. This prevents a situation in which a new instance is started and clashes with the previous one. The existing code contains a bug, likely introduced in 0f2d08b. The context returned by `stellarCoreRunner.context()` is cancelled in `stellarCoreRunner.close()` which initiates the termination process. At the same time, `CaptiveCoreBackend.PrepareRange()` calls `CaptiveCoreBackend.isClosed()` internally that checks which return value depends on `stellarCoreRunner` context being closed. This is wrong because it's possible that Stellar-Core is still not closed even when aforementioned context is cancelled - it can be still closing so the process can be still running. Because of this the following chain of events can lead to two Stellar-Core instances running (briefly) at the same time: 1. Stellar-Core instance is upgraded, triggering `fileWatcher` to call `stellarCoreRunner.close()` which cancels the `stellarCoreRunner.context()`. 2. In another Go routine, `CaptiveBackend.IsPrepared()` is called, which returns `false` because `stellarCoreRunner.context()` is canceled and then calls `CaptiveBackend.PrepareRange()` to restart Stellar-Core. `PrepareRange()` also checks if `stellarCoreRunner.context()` is cancelled (it is but Stellar-Core process can still run shutdown procedure) and then attempts to start a new instance. This commit is really a quick fix. Code before 0f2d08b was simpler because it was calling `Kill()` on a process so "terminating" and "terminated" were exactly the same state. After 0f2d08b there are now two events associated with a Stellar-Core process (as above). Because of this the code requires a larger refactoring. We may reconsider using `tomb` package I tried in #3258 that was later closed in favour of: #3278.
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
needed with deprecations, added features, breaking changes, and DB schema changes.
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
Update shutdown code to handle user initiated shutdown. Update shutdown code and go routines to use
tomb
.Why
Shutdown code was getting too complicated. We were missing some obvious cases (like user initiated shutdown), run multiple times into problems connected to shutdown (ex. closing a closed channel) and in general code was messy. In #3200 I found
tomb
and decided to give it a try. It has simple yet powerful API that covers sending kill signals, wait for a go routine to return and handling errors. The new code is much clearer than my first attempt in f876d60 that requires three extra methods onstellarCoreRunner
. It makes methods likebufferedLedgerMetaReader.waitForClose
(renamed toClose
as it now allows user initiated close too) easier to understand. Finally, tests are also easier to write.Known limitations
Let me know if you like
tomb
. If it's OK it would be great to use it in Horizon.