-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit ingestion lock critical section #3474
Comments
The horizon ingestion system interacts with 3 components to perform I/O operations:
There are no timeouts set on the requests to the history archives. It is possible that a request to the history archive hangs for a prolonged period of time while the ingestion lock is held. Unfortunately, it's not easy to configure timeouts for the history archive requests because the primary operation we are doing is streaming large files from S3. Although the lack of timeouts on history archive streaming operations is problematic, we perform history archive ingestion in rare circumstances (when the horizon db is empty or when we bump the version of the ingestion system), typically when starting the horizon ingestion service. If you run more than one horizon cluster, you can introduce upgrades on the inactive cluster and when the history archive ingestion is complete we can swap the cluster in so it becomes active. The ledger backend represents the source of ledgers which are fed into the horizon ingestion system. The ledger backend can be configured to be one of two implementations: DatabaseBackend or CaptiveCoreBackend. The DatabaseBackend obtains new ledgers by querying the Stellar Core DB. The CaptiveCoreBackend obtains new ledgers by running Stellar Core as a subprocess and sharing them with horizon via filesystem pipe. There are no timeouts set on the DatabaseBackend operations but it should be relatively easy to add timeouts to those operations. The CaptiveCoreBackend does not require us to add timeouts because it loads new ledgers in a separate goroutine from the main ingestion system thread. When the goroutine has finished reading a ledger from the file system pipe it pushes it onto a channel which is shared with the ingestion system. This design works well for us because it allows us to release the ingestion lock if there are no ledgers available in the buffer. The output of the horizon ingestion system is stored in postgres. In the ingestion process, horizon will execute batches of updates which will populate postgres with latest data derived from the incoming ledger. If the database connection for a horizon instance is closed, postgres should roll back the ingestion transaction and release the ingestion lock. Then another node in the horizon cluster will reattempt to ingest the latest ledger. However, it is possible for a database connection to not be closed when a machine has crashed or rebooted. In the event that the horizon node which has the ingestion lock crashes or reboots, the lock will not be released and ingestion won't be able to fail over to the other nodes in the horizon cluster. To prevent this scenario we can configure the keep alive settings of the postgres server to detect dead client connections. Once the dead client is detected postgres should rollback the ingestion transaction and thereby release the ingestion lock. The other case is that the postgres machine crashes or reboots. In this scenario, ingestion is completely blocked until postgres becomes available again. Once postgres is up again, the horizon nodes might still think the old connections are valid and not be able to resume ingestion. The ideal solution to this problem would be to configure the keep alive settings on the horizon side to detect when connections to postgres are dead. |
As an alternative to configuring keep alive settings on the horizon side, we could add timeouts to each of the db query and statements executed by horizon when it has the ingestion lock (see #2570). This would mitigate the scenario where the postgres machine reboots. |
Great analysis, thanks @tamirms!
I agree we can ignore this case. Also because when state ingestion is performed, ledger-by-ledger ingestion is stopped. When it comes to Horizon DB:
I was also thinking about two issue not listed here:
To sum it up I think in the short term we should probably:
|
@bartekn what about configuring tcp keep alive on the horizon side at the operating system level? The only downside I see is that it requires configuration outside of horizon. if that's a blocker we could have a goroutine which periodically calls |
Regarding the history archive timeouts, if the file size can be obtained in advance, we could dynamically adjust the timeout based on it (although, maybe we cannot rely on that in general). |
This is a good idea for internal network but, as you said, it requires some config outside Horizon which isn't ideal for anyone else. It will also affect all other services running on the host.
Good idea. I just wonder if, in case of connection issues, |
We can add a timeout to the ping operation via a context deadline |
Nice! Maybe we should do that instead of #2570 then? |
I think that makes sense |
@bartekn I just thought of one possible issue with the My concern is if postgres reboots without closing its connection but it does so quickly enough to come back online and respond to the |
Horizon supports a mode of distributed ingestion where a
SELECT FOR UPDATE
lock is used to facilitate leader election. Once a Horizon instance has the lock it attempts to ingest the next available sequence from stellar core. Afterwards, the horizon instance releases the lock and the next round of leader election begins.We should audit the code path that is executed when Horizon has the
SELECT FOR UPDATE
lock. In particular, we should make sure that all asynchronous or I/O bound operations have appropriate timeouts. Otherwise, such an operation could block indefinitely causing Horizon ingestion to stall.We may exempt operations on the Horizon db from having timeouts because there might be cases where we are ingesting very large ledgers and we expect those operations to be time consuming.
See #2570
The text was updated successfully, but these errors were encountered: