-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoints 'spoiled' when used to resume crawls #277
Comments
@ato points out that enabling copy-on-write via But after discussion we agreed we should implement the simplest possible fix first, and check it works, before looking into that as an optimisation. |
Example results. From the original checkpoint:
(yeah yeah it's running as root! sue me! ;-) ) From the same folder after using the checkpoint (but then killing the crawl):
i.e. the last file has grown by 1036733 bytes, i.e. about a MB. |
@ato points out that the heritrix3/commons/src/main/java/org/archive/bdb/BdbModule.java Lines 291 to 296 in aa705be
So, perhaps this doesn't work, or perhaps some other activity is managing to interact with the frontier while this is going on? heritrix3/commons/src/main/java/org/archive/bdb/BdbModule.java Lines 233 to 238 in aa705be
|
There was no later If these exceptions were getting silently swallowed, that might allow the system to silently fail to get to the freeze bit. heritrix3/commons/src/main/java/org/archive/bdb/BdbModule.java Lines 239 to 243 in aa705be
But Spring is calling |
Ah, I think I've got something. Right, looking at the BDB-JE code, I noticed that a i.e. BDB-JE does not 'roll' the files so the current set will no longer change, it just omits the latest file and gives you the set that can no longer change.
n.b. This doesn't actually contradict the startBackup documentation, but it's easy to misinterpret as doing something it doesn't. UPDATE Hmm, the startBackup code should call @ato points out ENV_RECOVERY_FORCE_NEW_FILE which is available in more recent DBD-JE versions. Perhaps this indicates issues with flipping the files on recovery? |
Related docs:
It's hard to tell what's going on with our old version, and whether this is the actual cause. But, it does indicate that a sensible alternative option would be to upgrade BDB-JE to >= 6.3, which is something we want to do anyway. |
Well, superficially at least, the upgrade seems fairly straightforward: https://github.com/ukwa/heritrix3/tree/upgrade-bdb-je One test was using |
We've seen occasional mysterious issues when resuming crawls from checkpoints multiple times. Close inspection of the behaviour when resuming the crawl indicates that checkpoints can only reliably be used once.
The code uses the DbBackup helper to manage backups. This depends on calling startBackup, the docs for which note:
i.e. the BDB-JE makes sure the backup is consistent and after that those files should no longer be altered (BDB is an append-only DB system). The backup documentation implies this flushed, sync'd consistency is necessary for the backup to work.
(Note that a H3 checkpoint is not same thing as a BDB-JE checkpoint - the former is a point in time backup, but the latter is a flush/sync operation).
However, when resuming from a crawl, rather than copying the checkpoint files (as recommended by the documentation), Heritrix3 uses hard links (and cleans out any other state files not part of the checkpoint).
This causes an issues because, having resumed a crawl, I noticed that the last
.jdb
file in the checkpoint was being changed! From the backup behaviour, we might expect that existing files would not be changed, but in fact when resuming a crawl, the system proceeds by appending data to the last.jdb
file.As this file is a hard link, this activity also changes the contents of the checkpoint. Furthermore, if we are resuming the crawl from one older checkpoint among many, all subsequent checkpoints are also modified.
As an example, we recently attempted to resume from a checkpoint, hit some difficulties and had to force-kill the crawler. After this, we attempted to re-start from that checkpoint, and hit some very strange behaviour that hung the crawler. (Fortunately, we happen to have a backup of those checkpoints!)
To avoid this, we could actually copy the last log file rather than make a hard link back to the checkpoint. Alternatively, it may be possible to call start/stop backup immediately upon restoring the DB, which should prevent existing files being appended to (assuming no race conditions).
The text was updated successfully, but these errors were encountered: