Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure valid checkpoints can be created when recovering from errors #392

Closed
anjackson opened this issue May 28, 2021 · 2 comments
Closed

Comments

@anjackson
Copy link
Collaborator

We hit occasional problems during checkpoint writing. These are mostly due to a non-checkpoint log file not being present when attempting to make a checkpoint, due to some earlier issue (not 100% clear what). This means, when you try to checkpoint, his happens:

SEVERE: org.archive.crawler.framework.CheckpointService checkpointFailed  Checkpoint failed [Fri May 28 10:28:03 GMT 2021]
java.io.IOException: Unable to move /heritrix/output/frequent-npld/20210519154706/logs/crawl.log to /heritrix/output/frequent-npld/20210519154706/logs/crawl.log.cp00032-20210528102802
        at org.archive.io.GenerationFileHandler.rotate(GenerationFileHandler.java:127)
        at org.archive.crawler.reporting.BufferedCrawlerLoggerModule.rotateLogFiles(BufferedCrawlerLoggerModule.java:331)
        at org.archive.crawler.reporting.BufferedCrawlerLoggerModule.doCheckpoint(BufferedCrawlerLoggerModule.java:393)
        at org.archive.crawler.framework.CheckpointService.requestCrawlCheckpoint(CheckpointService.java:285)
...

But the checkpoint partially completes, so if you fix the problem (e.g. by adding an empty log file), and try to re-checkpoint, you get:

INFO: org.archive.crawler.framework.CheckpointService requestCrawlCheckpoint no progress since last checkpoint; ignoring [Fri May 28 10:29:13 GMT 2021]
no progress since last checkpoint; ignoring

Which is all well and good, except that no coherent/consistent checkpoint has been written, so it's impossible to resume the crawl.

That check is implemented here:

// prevent redundant auto-checkpoints when crawler paused or stopping
if(controller.isPaused() || controller.getState().equals(CrawlController.State.STOPPING)) {
if (controller.getStatisticsTracker().getSnapshot().sameProgressAs(lastCheckpointSnapshot)) {
LOGGER.info("no progress since last checkpoint; ignoring");
System.err.println("no progress since last checkpoint; ignoring");
return null;
}
}

This could be addressed by making it possible to ignore the lastCheckpointSnapshot and force a checkpoint. Or, we could ensure that the lastCheckpointSnapshot field only gets updated after the checkpoint is completely successfully executed.

It is plausible that you might want to force a checkpoint even if no progress has been made, e.g. because it involved resolving an issue with the crawl state itself. But this seems like a rare exception.

@anjackson
Copy link
Collaborator Author

Looks like this line sets the laskCheckpointSnapshot and it's in the finally so it always gets called:

lastCheckpointSnapshot = controller.getStatisticsTracker().getSnapshot();

It should be the last thing in the try clause.

anjackson added a commit that referenced this issue May 28, 2021
Only update last checkpoint stats if the checkpoint completed, for #392.
@anjackson
Copy link
Collaborator Author

The stats are now only written if the checkpoint executes without throwing an exception, which is as good as it's likely to get for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant