Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: Handle user initiated shutdown during catchup #3258

Closed

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Nov 27, 2020

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Update shutdown code to handle user initiated shutdown. Update shutdown code and go routines to use tomb.

Why

Shutdown code was getting too complicated. We were missing some obvious cases (like user initiated shutdown), run multiple times into problems connected to shutdown (ex. closing a closed channel) and in general code was messy. In #3200 I found tomb and decided to give it a try. It has simple yet powerful API that covers sending kill signals, wait for a go routine to return and handling errors. The new code is much clearer than my first attempt in f876d60 that requires three extra methods on stellarCoreRunner. It makes methods like bufferedLedgerMetaReader.waitForClose (renamed to Close as it now allows user initiated close too) easier to understand. Finally, tests are also easier to write.

Known limitations

Let me know if you like tomb. If it's OK it would be great to use it in Horizon.

@cla-bot cla-bot bot added the cla: yes label Nov 27, 2020
bartekn added a commit that referenced this pull request Dec 2, 2020
…exit during catchup (#3260)

Fixes a bug introduced in a10c000 because of which `PrepareRange` and
`GetLedger` methods could return an error after Stellar-Core process
exit but before all ledgers are read from the buffer. To fix it we now
handle process exit in `bufferedLedgerMetaReader` only and only in case
of errors. In `PrepareRange` we return error only on Stellar-Core exit
with error. This won't work with user initiated shutdown, it will be
fixed in #3258.
@bartekn bartekn marked this pull request as ready for review December 2, 2020 23:21
@bartekn bartekn requested a review from a team December 2, 2020 23:21
@tamirms
Copy link
Contributor

tamirms commented Dec 3, 2020

@bartekn looks like there's a v2 of tomb https://github.com/go-tomb/tomb/tree/v2

@bartekn
Copy link
Contributor Author

bartekn commented Dec 3, 2020

@bartekn looks like there's a v2 of tomb https://github.com/go-tomb/tomb/tree/v2

Yes, I saw that but we already have v1 in our deps and v2 doesn't add anything that we need here. We may upgrade in the future if necessary.

@tamirms
Copy link
Contributor

tamirms commented Dec 8, 2020

@bartekn I think there are potentially thread safety issues if Close() is called concurrently with GetLedger() or PrepareRange(). If we allow user initiated shutdown, Close() can be called at any time regardless of the state of the ingestion system

@bartekn
Copy link
Contributor Author

bartekn commented Dec 8, 2020

@bartekn I think there are potentially thread safety issues if Close() is called concurrently with GetLedger() or PrepareRange(). If we allow user initiated shutdown, Close() can be called at any time regardless of the state of the ingestion system

I had this in mind when working on this PR. If you see any issues connected to calling Close() when PrepareRange() or GetLedger() is running can you add comments inline?

@@ -474,7 +490,7 @@ loop:
return false, xdr.LedgerCloseMeta{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Close() is called concurrently with this function is there a possible race condition where the ledger buffer is set to nil right before it is accessed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I focused on shutdown code and forgot about the obvious things. Fixed in 648ebf5.

@bartekn
Copy link
Contributor Author

bartekn commented Dec 8, 2020

I run the code again after recent changes to ensure it's working ok but I noticed a very slow catchup stage. I fixed it in 6acf8f8. The problem was connected to the fact that only one ledger was fetched from the buffer in PrepareRange, I changed it to empty the buffer and then sleep (also, decreased sleep time from 1s to 100ms).

@tamirms
Copy link
Contributor

tamirms commented Dec 8, 2020

@bartekn I think the code is thread-safe now. the only issue I see is that calling Close() may take a long time because it will block on acquiring the lock until PrepareRange() completes

@bartekn
Copy link
Contributor Author

bartekn commented Dec 8, 2020

the only issue I see is that calling Close() may take a long time because it will block on acquiring the lock until PrepareRange() completes

Good call again, fixed in 0505023.

@tamirms
Copy link
Contributor

tamirms commented Dec 9, 2020

@bartekn I think there is another scenario which is not covered. Consider if the ingestion system's shutdown method is called right before the state machine go routine calls PrepareRange(). I think there's a possible race condition where the ingestion system shutdown method calls Close() right before the state machine go routine calls PrepareRange(). While in PrepareRange() the function continues to execute as normal and it would be unaware that the system is shutting down.

@bartekn
Copy link
Contributor Author

bartekn commented Dec 9, 2020

@tamirms you are right but except an obvious idea which is disallow reusing CaptiveStellarCore after closing that would require a larger refactor I'm not sure how to solve this. I'd vote for creating an issue for this and other issues connected to shutdown and fix them after beta release. What do you think?

@tamirms
Copy link
Contributor

tamirms commented Dec 9, 2020

@bartekn sounds good, I think I have an idea on how to tackle it but I think I'd need a day or two to verify the idea. Fixing it after the beta release makes sense

}

time.Sleep(c.waitIntervalPrepareRange)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need an explicit sleep here?

I would (at the very least) use a ticker and intertwine the check with the other channels in the select statement.

Even better, It may even be possible to remove the wait and condition by using an asynchronous go-routine for fast-forwarding.

It may seem outside of the scope of this PR, but it affects the shutdown wait time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try it out for sure! Can you create a new issue? I'm not sure if we'll be able to test it before beta.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

if len(c.ledgerBuffer.getChannel()) > 0 {
break
// Wait/fast-forward to the expected ledger or an error. We need to check
// buffer length because `GetLedger` may be blocking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to add a context to GetLedger() which we can cancel on shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wanted to avoid using context.WithCancel here because it doesn't provide two things:

  1. wait until cancel request has been handled so the caller can be sure that it's safe to exit (like: core process exit, files removed, or go routine in a buffer returns),
  2. easy way to pass a final error value if any.

tomb provides mechanism for both and because of this: simplifies entire process. I'm open for discussion, can we talk in a new issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will open a new issue.

Copy link
Contributor Author

@bartekn bartekn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shaptic you asked for some comments explaining how this code works. I recommend reading this article (I know you did) but as requested added a few comments. As I mentioned in the other comment, the main advantage of tomb is that it allows sending cancellation signal but also allow propagating errors back to the caller (and original error is not overwritten, this is useful when there is user-initiated shutdown so you can send Cancelled error to detect it) and allows waiting for the go routine to actually return. Obviously it's possible to built similar solution using channels, wait groups but tomb gives this out of the box. Read on...

Comment on lines +305 to +307
// Kill tomb with context.Canceled. Kill will be called again in start()
// when process exit is handled but the error value will not be overwritten.
r.tomb.Kill(context.Canceled)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As mentioned in summary above there are two ways we can kill Stellar-Core process. The first option is here. This is user-initiated shutdown and we can determine it later in caller (CaptiveStellarCore in our case) by checking if the error (tomb.Err()) is equal context.Canceled. The nice property of tomb is that the error is not overwritten. So if Kill is called again later we will still know it's user init'd shutdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer to this is probably painfully-obvious, but by "user-initiated shutdown," we generally mean something akin to Ctrl+C and/or via a code path (e.g. calling app.Close()) right?

Comment on lines +33 to +34
c.tomb.Kill(c.cmd.Wait())
c.tomb.Done()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Second option is that Stellar-Core exited itself. We pass the error to caller using Kill again. It's possible that there is no error an Stellar-Core exited gracefully but please note that this exit may be unexpected in some cases so it may still trigger an error. Also, this is one of the features of tomb that's missing in other solutions. It has an easy way to propagate error back to the caller.

Comment on lines +377 to 387
processErr := c.stellarCoreRunner.getTomb().Err()
switch {
case processErr == nil && !ledgerRange.bounded:
return errors.New("stellar-core process exited unexpectedly without an error")
case processErr == nil && ledgerRange.bounded:
return nil
case processErr == context.Canceled:
return processErr
default:
return errors.Wrap(processErr, "stellar-core process exited with an error")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Now, this is where we actually try to handler the Stellar-Core process exit. I think that all the cases are explained pretty well in a comment above. One thing I wanted to note is what I already described in the previous comment. Please note that the first case handles the case in which the error is nil but we still return an error because it was unexpected.

close(r.shutdown)
r.wg.Wait()
if r.tomb != nil {
r.tomb.Wait()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I just wanted to briefly describe the last property of tomb which is waiting for the go routine to return. Obviously you could achieve the same thing using channels or wait group but this works really well with other tomb API methods.


select {
case b.c <- metaResult{meta, err}:
case <-b.tomb.Dying():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Finally, there are several events we can listen to. In this case Dying is when Kill was called but it's possible that tomb is not Dead yet. We can use this signal to start shutting down go routine.

// Range already prepared
if prepared, err := c.IsPrepared(ledgerRange); err != nil {
if prepared, err := c.isPrepared(ledgerRange); err != nil {
c.mutex.Unlock()
Copy link
Contributor

@Shaptic Shaptic Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safer / cleaner to do something like

defer func() {
  if !unlocked { 
    c.mutex.Unlock()
    unlocked = true
  }
}

instead (and set unlocked=true below before the waitloop)? Or, alternatively, moving the waitloop to its own function would let us just defer c.mutex.Unlock() directly here.

It just seems like it could be really easy for someone (like me lol) to edit this code in the future, then forget to unlock the mutex and be in a dangerous state.

@bartekn
Copy link
Contributor Author

bartekn commented Dec 14, 2020

Closing in favour of: #3278.

@bartekn bartekn closed this Dec 14, 2020
bartekn added a commit that referenced this pull request Oct 21, 2021
…evious instance termination (#4020)

Add a code to `CaptiveCoreBackend.startPreparingRange` that ensures that
previously started Stellar-Core instance is not running: check if
`getProcessExitError` returns `true` which means that Stellar-Core process is
fully terminated. This prevents a situation in which a new instance is started
and clashes with the previous one.

The existing code contains a bug, likely introduced in 0f2d08b. The context
returned by `stellarCoreRunner.context()` is cancelled in
`stellarCoreRunner.close()` which initiates the termination process. At the same
time, `CaptiveCoreBackend.PrepareRange()` calls `CaptiveCoreBackend.isClosed()`
internally that checks which return value depends on `stellarCoreRunner` context
being closed. This is wrong because it's possible that Stellar-Core is still not
closed even when aforementioned context is cancelled - it can be still closing
so the process can be still running.

Because of this the following chain of events can lead to two Stellar-Core
instances running (briefly) at the same time:

1. Stellar-Core instance is upgraded, triggering `fileWatcher` to call
   `stellarCoreRunner.close()` which cancels the `stellarCoreRunner.context()`.
2. In another Go routine, `CaptiveBackend.IsPrepared()` is called, which returns
   `false` because `stellarCoreRunner.context()` is canceled and then calls
   `CaptiveBackend.PrepareRange()` to restart Stellar-Core. `PrepareRange()`
   also checks if `stellarCoreRunner.context()` is cancelled (it is but
   Stellar-Core process can still run shutdown procedure) and then attempts to
   start a new instance.

This commit is really a quick fix. Code before 0f2d08b was simpler because it
was calling `Kill()` on a process so "terminating" and "terminated" were exactly
the same state. After 0f2d08b there are now two events associated with a
Stellar-Core process (as above). Because of this the code requires a larger
refactoring. We may reconsider using `tomb` package I tried in #3258 that was
later closed in favour of: #3278.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants