`litestream` recommendation may lead to crashes? #4860

ZmnSCPxj · 2021-10-12T00:04:48Z

In BACKUP.md we have a recommendation to use litestream. However, I snuck a look at the official litestream online docs:

Litestream works by effectively taking over the checkpointing process. It starts a long-running read transaction to prevent any other process from checkpointing and restarting the WAL file.

In filesystem terms, that "read transaction" is an advisory shared lock on the database file. When the SQLITE instance running inside lightningd wants to write to the database file (and we write all the time, you should see the blinkenlights on the SSD on my Pi), however, it demands an advisory exclusive lock.

SQLITE will auto-retry the advisory lock for some time (I believe we actually explicitly set this timeout somewhere...? need to dig). However after a while it will quit with "database is locked" and then lightningd kills itself.

Note that in BACKUP.md we recommend against using sqlite3 .dump or VACUUM INTO (the last paragraph). Litestream is actually similar to an efficient .dump (it only copies (part of) the WAL file instead of dumping the entire database), but this is a difference in quantity and not quality: the backup process read-locks the file, and we write-lock very often, and on slow HDDs or high-load systems the timeout on the write-lock retry can be reached. If we disrecommend sqlite3 .dump I think we should disrecommend against litestream as well (or if the quantity is fairly low enough, maybe put up warnings at the very least, and move litestream to after the "copy the file" section).

The text was updated successfully, but these errors were encountered:

cdecker · 2021-10-15T10:31:50Z

Hm, ok. I was actually wondering how it was doing replication, and someone said they'd be using the wal, and not locking the DB. If, despite following the document regarding setup (enabling wal mode), it crashes we should remove that recommendation altogether.

ZmnSCPxj · 2021-10-15T13:55:08Z

I strongly suspect #4857 root cause is litestream replication; let us wait for response there from OP to see if patching helps.

Note that we now have a 5-second timer on sqlite3_busy_timeout, which will prevent crashing from litestream in most cases (#4857 uses a version which predates this! #4857 (comment)). However, in high load cases or very slow media (or both!) 5 seconds may be too short still. See #4867

ZmnSCPxj · 2021-10-15T14:03:40Z

Anyway, to clarify: litestream has to lock the main database file. SQLITE3 uses the WAL file as a FIFO ring buffer, meaning it could start overwriting old log entries before litestream can read them if the main database file is not locked. As noted in the actual litestream documentation, it uses a long-running read transaction, which is, in fact, a read-shared advisory lock on the file, in order to prevent the WAL file from wrapping around before litestream can read older log entries.

The advantage of using the WAL file is that litestream really only needs to lock for very short periods of time (well, short relatively --- it is still a "long-running" read transaction that locks for entire hundreds of milliseconds when most read transactions will be less than a hundred milliseconds tops) because it only needs to read very short WAL entries. Using something like sqlite3 .dump or VACUUM INTO locks for several seconds because it has to read a lot more data from the database, so relative to those solutions, litestream is better.

Closes: ElementsProject#4860 ChangeLog-Added: With `sqlite3` db backend we now use a 60-second busy timer, to allow backup processes like `litestream` to operate safely.

Closes: #4860 ChangeLog-Added: With `sqlite3` db backend we now use a 60-second busy timer, to allow backup processes like `litestream` to operate safely.

ZmnSCPxj mentioned this issue Oct 13, 2021

crashes with "database is locked" #4857

Closed

ZmnSCPxj mentioned this issue Oct 15, 2021

Change sqlite3_busy_timeout to 60 seconds, add warnings re: litestream. #4867

Merged

ZmnSCPxj closed this as completed in #4867 Oct 17, 2021

ZmnSCPxj added a commit that referenced this issue Oct 17, 2021

wallet/db_sqlite3.c: Increase busy timeout to 60 seconds.

ab026fb

Closes: #4860 ChangeLog-Added: With `sqlite3` db backend we now use a 60-second busy timer, to allow backup processes like `litestream` to operate safely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`litestream` recommendation may lead to crashes? #4860

`litestream` recommendation may lead to crashes? #4860

ZmnSCPxj commented Oct 12, 2021

cdecker commented Oct 15, 2021

ZmnSCPxj commented Oct 15, 2021 •

edited

Loading

ZmnSCPxj commented Oct 15, 2021 •

edited

Loading

litestream recommendation may lead to crashes? #4860

litestream recommendation may lead to crashes? #4860

Comments

ZmnSCPxj commented Oct 12, 2021

cdecker commented Oct 15, 2021

ZmnSCPxj commented Oct 15, 2021 • edited Loading

ZmnSCPxj commented Oct 15, 2021 • edited Loading

`litestream` recommendation may lead to crashes? #4860

`litestream` recommendation may lead to crashes? #4860

ZmnSCPxj commented Oct 15, 2021 •

edited

Loading

ZmnSCPxj commented Oct 15, 2021 •

edited

Loading