Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc/BACKUP.md: Document backup strategies for lightningd. #4207

Merged
merged 1 commit into from
Dec 15, 2020

Conversation

ZmnSCPxj
Copy link
Contributor

@ZmnSCPxj ZmnSCPxj commented Nov 17, 2020

Requested by @d4amenace here: #4200 (comment)

@cdecker to finish up the document re: PostgreSQL. I remember somebody made a medium article on C-Lightning PostgreSQL but I did not save the link...

Changelog-None

@ZmnSCPxj ZmnSCPxj requested a review from cdecker November 17, 2020 13:15
@jsarenik
Copy link
Contributor

Let me link also #4181 for reference.

Thank you @ZmnSCPxj for working on it!

@ZmnSCPxj
Copy link
Contributor Author

Clarified location of $LIGHTNINGDIR.

@ZmnSCPxj
Copy link
Contributor Author

Corrected misspelling of lighningd.

Copy link
Contributor

@jsarenik jsarenik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful documentation (except the BTRFS ad :-P)
Thank you!

doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated

* Attempt to recover using the other backup options below first.
Any one of them will be better than this backup option.
* Recovering by this method ***MUST*** always be the ***last*** resort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to something like "Use this method ONLY as a last resort" (goes well with "as long as you:" above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with "Recover by this method ONLY as a last resort"

doc/BACKUP.md Outdated
BTRFS would probably work better if you were purchasing an entire set
of new storage devices to set up a new node.

On BSD you can use a ZFS RAID-Z setup, which is probably better than BTRFS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general advice (like first paragraph in this section also) could be moved together, BTRFS could be turned into its own sub-section so that it is optically skippable (sp?).

BTW this review comes from a point of view of a non-native speaker. So take it with a grain of salt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I guess we can list software methods of RAID-1, such as:

  • mdadm
  • BTRFS
  • ZFS

doc/BACKUP.md Outdated Show resolved Hide resolved
@ZmnSCPxj
Copy link
Contributor Author

Updated as per @jsarenik feedback.

@cdecker
Copy link
Member

cdecker commented Nov 18, 2020

I think you're referring to @gabridome's excellent tutorial here: https://github.com/gabridome/docs/blob/master/c-lightning_with_postgresql_reliability.md

@ZmnSCPxj
Copy link
Contributor Author

Add autodefrag to BTRFS instructions, minor wording tweak, expand "PostgreSQL" section.

@darosior
Copy link
Contributor

darosior commented Nov 19, 2020

Would be worth to link it from https://lightning.readthedocs.io/FAQ.html#how-to-backup-my-wallet i think

@ZmnSCPxj
Copy link
Contributor Author

Mention the SQLITE3 backup API, add link from FAQ.md to BACKUP.md

@ZmnSCPxj
Copy link
Contributor Author

Mention using the devid in BTRFS.

@ZmnSCPxj ZmnSCPxj mentioned this pull request Nov 20, 2020
doc/BACKUP.md Outdated

This creates a consistent snapshot of the database, sampled in a
transaction, that is assured to be openable later by `sqlite3`.
The operation of the `lightningd` process will be paused while `sqlite3`
Copy link
Contributor Author

@ZmnSCPxj ZmnSCPxj Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THIS IS FACTUALLY INCORRECT.

I did a stress test where a program continuously makes many small transactions to update a database. When you do something like .backup 'backup.sqlite3' or VACUUM INTO 'backup.sqlite3'; in a separate sqlite3, then either the main program gets an SQLITE_BUSY, or the separate sqlite3 returns a "Database is locked" error.

I suspect that an SQLITE_BUSY will cause lightningd to crash, so this is probably strongly not recommended to use a separate process to back up.

Doing this with my running lightningd does not cause this problem, but probably only because my stress-test program keeps doing queries continuously, whereas a good amount of time lightningd just sits there waiting for something interesting to happen. But race conditions can exist so we should not recommend this in our backup strategy document!

We can do:

  • Expose an sqlite3_backup command which performs the VACUUM INTO query inside lightningd.
  • Call into sqlite3_busy_timeout with a "reasonable" large value, say 5000 (5 seconds). (Equivalently, call PRAGMA busy_timeout = 5000; query at the same time we do PRAGMA foreign_keys=on;). Then, even if the .backup or VACUUM INTO takes up to 5 seconds to back up, lightningd will just wait.

The latter solution is easier but there is always the possibility that the backup process will take more than whatever timeout we select. The earlier solution is more reliable and there is no timeout, but note that if say the target location is slow (e.g. an NFS mount) then lightningd will suspend indefinitely.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ZmnSCPxj for ideas and investigating possible issues with running backup in separate sqlite3 process!

Exposing the sqlite3_backup command would be great! It sounds like a very nice and clear way to do backup.

@ZmnSCPxj
Copy link
Contributor Author

Remove the factually incorrect text mentioned in #4207 (comment)

@ZmnSCPxj
Copy link
Contributor Author

Teach how to quickly check if a backup database is corrupted.

doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated
lose all funds in open channels.

However, again, note that a "no backups #reckless" strategy leads to
*definite* loss of funds, so you should still prefer *this* strategy rather
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, what if the other user close the channels? wouldn't you get the money back to your wallet?

Copy link
Contributor

@gabridome gabridome Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand from the text, you can recover your funds only if your peer

a) Use the option_dataloss_protect correctly (not pretending to do it, just to grab your funds). Otherwise, you simply doesn't own the private keys of the address where the funds are sent anymore. Also "If the peer does not support this option, then the entire channel funds will be revoked by the peer."
b) If/when he decide to force close the channel with you.

That seem clear to me so I wouldn't change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add a bit more complexity: if we use option_upfront_shutdown_script we may actually get the funds back to our wallet if the peer closes, because that removes the tweak to the their_unilateral/to_us transaction.

Copy link
Contributor

@manreo manreo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good info

@ZmnSCPxj
Copy link
Contributor Author

Okay, we can actually implement an sqlite3-backup-snapshot in a plugin, I think.

Basically, lightningd performs transactions this way:

  • BEGIN; to the sqlite3 connection.
  • Send queries to the connection. Write queries are stored in a log maintained by the db_write hook.
  • Just before committing, trigger the db_write hook.
    • If it returns something other than {"result":"continue"}, exit the entire program, which implicitly rolls back the transaction.
  • COMMIT; to the sqlite3 connection.

If so, a plugin that hooks into db_write can delay the COMMIT; query by the simple act of not responding immediately. And the write(s) to the database file are triggered by COMMIT;; without the COMMIT; everything is in the -journal file.

So it looks to me that we can have a C plugin which hooks into db_write, and the db_write handler just responds with {"result":"continue"}, and offers a sqlite3-snapshot command. The handler of the sqlite3-snapshot just uses synchronous code (which implicitly causes any db_write hook to block) to copy the lightningd.sqlite3 file elsewhere. After it does the copy it responds and unblocks to the mainloop, letting any pending db_write through (it could "simply" unblock after the copy and do VACUUM on the copy, possibly also deleting entries in non-vital tables like invoices, and non-pending payments to reduce the size of the snapshot). The plugin can check if the lightningd is on sqlite3 by checking wallet in listconfigs I probably need to write some stress-test experimental code to simulate this.

In WAL mode we need to copy the -wal file as well. It is possible to set your lightningd.sqlite3 to WAL mode by stopping lightningd, then sqlite3 lightningd.sqlite3 and issue PRAGMA journal_mode=WAL, then .exit the sqlite3 and restarting lightningd. You can revert with PRAGMA journal_mode=DELETE;

@ZmnSCPxj
Copy link
Contributor Author

Okay, we can actually implement an sqlite3-backup-snapshot in a plugin, I think.

Not actually, because the doc/PLUGINS.md specifically says that a plugin that hooks into db_write should not handle any other hooks or commands. Since we have this in our doc, we should really not violate it ourselves, as otherwise future contributors might provide code that assumes no decent plugin will handle db_write and another command. So it would still be best to either add a decent busy timeout to our db connection so a third-party can snapshot the file in a separate sqlite3 connection, or implement something inside the lightningd process to do the snapshot.

@d4amenace
Copy link

....... or implement something inside the lightningd process to do the snapshot.

This actually made sense to me. There could be a configurable timeout period in the conf file to denote the period of snapshots. At each interval writing would be paused for the db so that corrupted snapshots would be avoided. There would be a standardized snapshot folder in the users .lighting folder which would be the most all-encompassing solutions of all the implementations.

@ZmnSCPxj
Copy link
Contributor Author

There could be a configurable timeout period in the conf file to denote the period of snapshots.

All that lightningd has to expose at the most basic level would be the ability to make one snapshot at the current time. Regular snapshots like what you want can be done by a separate program, xref. autocleaninvoices which has the scheduled invoice cleaning in a plugin, not in lightningd (lightningd just exposes a command to clean old invoices using a single db query, the auto-cleaning is done by the plugin).

In particular, such a regular snapshotting would be vastly inferior to db_write:

  • During times when there are lots of updates to your channel state, your regular snapshotting would miss them. If the node crashes at that point, you risk losing funds.
  • During times when there is nothing happening, your regular snapshotting will copy files over and over, wasting space and time on substantially the same data.
    • If you want something "smart" like say "oh just save the changes between this snapshot and the previous one", that is what db_write already is, plus it triggers at each update, so just use db_write, via the existing backup.py plugin, already.

Regular snapshotting of the db say once or twice a day would be a good "backup of a backup" --- your primary backup should still be a db_write plugin OR a PostgreSQL cluster OR at the absolute bare minimum some kind of RAID filesystem setup, and preferably all of them. Recovering from a snapshot would be the sort of thing you would do as a last resort, only-use-this-if-Godzilla-attacks recovery in case all your other backups are broken. And you would want that regular snapshotting to, say, be uploaded somewhere else, like encrypted on some cloud server or at least on a computer you own located in the house of a friend. Since that probably involves some kind of custom arrangement, it would not be easy to standardize such an interface inside lightningd; so a single-shot snapshot would be enough in C-Lightning, I think, and you build your custom arrangement on top with crontab.

@gabridome
Copy link
Contributor

gabridome commented Nov 24, 2020

Requested by @d4amenace here: #4200 (comment)

@cdecker to finish up the document re: PostgreSQL. I remember somebody made a medium article on C-Lightning PostgreSQL but I did not save the link...

Available for any question, comment or help.

@ZmnSCPxj
Copy link
Contributor Author

@gabridome does the PostgreSQL section I made make sense?

doc/BACKUP.md Outdated
lose all funds in open channels.

However, again, note that a "no backups #reckless" strategy leads to
*definite* loss of funds, so you should still prefer *this* strategy rather
Copy link
Contributor

@gabridome gabridome Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand from the text, you can recover your funds only if your peer

a) Use the option_dataloss_protect correctly (not pretending to do it, just to grab your funds). Otherwise, you simply doesn't own the private keys of the address where the funds are sent anymore. Also "If the peer does not support this option, then the entire channel funds will be revoked by the peer."
b) If/when he decide to force close the channel with you.

That seem clear to me so I wouldn't change it.

following command gives you a path:

pg_config --includedir

Copy link
Contributor

@gabridome gabridome Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know it. When I built it, everything worked out of the box. Pretty useful to know.
I probably had already the library installed.
I would add this part also to a specific document about lightning with Postgres (which I haven't found).
I will add this to my guide also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, maybe the Lightning+PostgreSQL can be a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document you mean?
Something like doc/postgresQL.md?

Anyway This part in the backup strategies seems pretty right to me. At least to mention a more extended part in the doc directory. @fiatjaf Has thought about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the document I mean.

You should use the same PostgreSQL version of `libpq-dev` as what you run
on your cluster, which probably means running the same distribution on
your cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I have added on the cluster machines the Postgres .deb repository in this way I can keep the versions aligned.
I didn't encounter any problem but maybe it is not wise to exit from the Postgres version installed with the Debian distribution. What you guys think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have used the same distro throughout just to reduce inter-version problems, which is why I suggested the above. Mixing deb repositories has gotten me into trouble before when I wanted to upgrade my OS (but upgrading an OS is always fraught, sometimes it is best to just have two small OS partitions and alternate installing OS's between them rather than upgrading an existing install...)

doc/BACKUP.md Outdated
(though you should probably do some more double-checking and tire-kicking
in the "Connect to the database" stage you resume at, such as checking if
`listpeers` still lists the same channels as you had, and so on).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wise suggestion IMO.

Debian Testing ("bullseye") uses PostgreSQL 13.0 as of this writing.
PostgreSQL 12 had a non-trivial change in the way the restore operation is
done for replication.
You should use the same PostgreSQL version of `libpq-dev` as what you run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered the problem myself. I would add to always read the official synchronous replication PostgresQL guide for the specific version you are using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will add that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there is little point in replicating the official doc here, it may just end up with stale information, requiring updates from us. If there is an authoritative source, link to it.

[guide by @gabridome][gabridomeguide].

[gabridomeguide]: https://github.com/gabridome/docs/blob/master/c-lightning_with_postgresql_reliability.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guide is not maintained as the versions of Postgres evolve. Please always check also the Postgres Guide about synchronous replication for your specific version.

@gabridome
Copy link
Contributor

@gabridome does the PostgreSQL section I made make sense?

Absolutely. Thank you for mentioning my guide.

@ZmnSCPxj
Copy link
Contributor Author

Rebased, emphasized checking your PostgreSQL version, explicated more about tire-kicking after SQLITE3->PG conversion.

@ZmnSCPxj
Copy link
Contributor Author

ZmnSCPxj commented Dec 1, 2020

No ACKs? People seem to want this. Is there any issue or quibble that prevents this from being merged?

@gabridome
Copy link
Contributor

No ACKs? People seem to want this. Is there any issue or quibble that prevents this from being merged?

FWIW conceptACK.

I plaude loudly this incredible work on the most important topic of the LN that has been delayed for too long.

I frankly hope that these strategies will become more and more user friendly as the time passes. Until then, the paradox is that the ones less aware of the problem, are the one more in need of a good solution...

@gabridome
Copy link
Contributor

Hi,
there's a new possibility for the backup of hsm_secret file.
I find it useful because it converts to xpriv format, so you can import into one descriptor based wallet in Bitcoin Core or wherever xpriv are accepted.
https://github.com/domegabri/lightning-secret

@ZmnSCPxj
Copy link
Contributor Author

@gabridome how do we recover the hsm_secret from the xprv? The program seems to only convert from hsm_secret to xprv.

@gabridome
Copy link
Contributor

gabridome commented Dec 11, 2020 via email

Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay reviewing this. It looks quite good, but reordering might be good, to emphasize which mechanisms are preferred, and for whom it might be worth it:

  • Backup hsm_secret: static, for all users
  • Backup plugins for end-users
  • Postgresql replication for enterprises
  • File-backup: if no other option, and with appropriate warnings
    • Backup while offline
    • Hot backups while the node is running

doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated

But in Lightning, since *you* are the only one storing all your
financial information, you ***cannot*** recover this financial
information anywhere else.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
information anywhere else.
information from anywhere else.

doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Show resolved Hide resolved
doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated
This creates an initial copy of the database at the NFS mount.
* Add these settings to your `lightningd` configuration:
* `important-plugin=/path/to/backup.py`
* `backup-destination=file:///path/to/nfs/mount`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed, because we now write a lock file into the $LIGHTNINGDIR in order to be functional right-away, and avoid accidentally changing the backup location.

doc/BACKUP.md Outdated Show resolved Hide resolved
doc/BACKUP.md Outdated Show resolved Hide resolved
Debian Testing ("bullseye") uses PostgreSQL 13.0 as of this writing.
PostgreSQL 12 had a non-trivial change in the way the restore operation is
done for replication.
You should use the same PostgreSQL version of `libpq-dev` as what you run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there is little point in replicating the official doc here, it may just end up with stale information, requiring updates from us. If there is an authoritative source, link to it.

doc/BACKUP.md Show resolved Hide resolved
@svewa
Copy link

svewa commented Dec 13, 2020

Seeing the discussion on how to backup the sqlite file, would it not be best to not filesystem-copy it, but sqlite3 .dump it? https://www.sqlitetutorial.net/sqlite-dump/ or would this not guarantee consistency?

Copy link
Contributor

@darosior darosior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! I now have a reference to point people to instead of handwaving the different possibilities each time i'm asked :)

Just a small comment re the db migration tool which i'm not sure is safe.

doc/BACKUP.md Outdated Show resolved Hide resolved
@darosior
Copy link
Contributor

darosior commented Dec 14, 2020

Hi,
there's a new possibility for the backup of hsm_secret file.
I find it useful because it converts to xpriv format, so you can import into one descriptor based wallet in Bitcoin Core or wherever xpriv are accepted.
https://github.com/domegabri/lightning-secret

FWIW, it was already part of hsmtool since last release (EDIT: and i noticed Riccardo Casatta did a new one last week too ! 😭) ..
Descriptor are really neat but ya footgun in our case, and yea, as Zmn points out you cannot recover the hsm_secret from the xpriv.

@ZmnSCPxj
Copy link
Contributor Author

ZmnSCPxj commented Dec 14, 2020

Seeing the discussion on how to backup the sqlite file, would it not be best to not filesystem-copy it, but sqlite3 .dump it? https://www.sqlitetutorial.net/sqlite-dump/ or would this not guarantee consistency?

It locks the file while doing the dump. The VACUUM INTO query is similar. If lightningd accesses the database while your backup process is .dumping or VACUUM INTOing, it will crash lightningd, which is bad. Because it is a race condition, you can be running this in production and completely unaware of the issue, and later accidentally crash lightningd --- I only discovered this issue in extensive mock-testing (and already had a .dump backup crontab in the background, which I have since replaced with a cp --reflink=always)

ChangeLog-Added: Document: `doc/BACKUP.md` describes how to back up your C-lightning node.
@ZmnSCPxj
Copy link
Contributor Author

Rebased, reordered sections as per @cdecker, changed to use # for section headings, various changes as per @cdecker , remove link to /fiatjaf/mcdlsp , warn against .dump and VACUUM INTO, mention use of multiple USB flash disks for RAID1.

Copy link
Contributor

@darosior darosior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 948489a

Thanks for taking the time to write this up.

@cdecker cdecker merged commit 0e326dd into ElementsProject:master Dec 15, 2020
@cdecker
Copy link
Member

cdecker commented Dec 15, 2020

Excellent work @everyone, this was very much needed, and thanks in particular to @ZmnSCPxj for taking the initiative 👍

Even if you have one of the better options above, you might still want to do
this as a worst-case fallback, as long as you:

* Attempt to recover using the other backup options below first.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/below/above/

LOL I did not fix it up after rearranging the text!

@ZmnSCPxj ZmnSCPxj deleted the doc-backup branch December 16, 2020 04:38
@domegabri
Copy link

domegabri commented Dec 17, 2020

Hi,
there's a new possibility for the backup of hsm_secret file.
I find it useful because it converts to xpriv format, so you can import into one descriptor based wallet in Bitcoin Core or wherever xpriv are accepted.
https://github.com/domegabri/lightning-secret

FWIW, it was already part of hsmtool since last release (EDIT: and i noticed Riccardo Casatta did a new one last week too ! ) ..
Descriptor are really neat but ya footgun in our case, and yea, as Zmn points out you cannot recover the hsm_secret from the xpriv.

Hi @darosior , I think the last release allows to dump public version of the descriptors. But I needed xpriv because I had a corrupt db and wanted to scan for coins and recover them

@darosior
Copy link
Contributor

@domegabri oh right we removed it in later stage as it was a nice footgun! #4171 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants