Ensure valid checkpoints can be created when recovering from errors #392

anjackson · 2021-05-28T11:34:18Z

We hit occasional problems during checkpoint writing. These are mostly due to a non-checkpoint log file not being present when attempting to make a checkpoint, due to some earlier issue (not 100% clear what). This means, when you try to checkpoint, his happens:

SEVERE: org.archive.crawler.framework.CheckpointService checkpointFailed  Checkpoint failed [Fri May 28 10:28:03 GMT 2021]
java.io.IOException: Unable to move /heritrix/output/frequent-npld/20210519154706/logs/crawl.log to /heritrix/output/frequent-npld/20210519154706/logs/crawl.log.cp00032-20210528102802
        at org.archive.io.GenerationFileHandler.rotate(GenerationFileHandler.java:127)
        at org.archive.crawler.reporting.BufferedCrawlerLoggerModule.rotateLogFiles(BufferedCrawlerLoggerModule.java:331)
        at org.archive.crawler.reporting.BufferedCrawlerLoggerModule.doCheckpoint(BufferedCrawlerLoggerModule.java:393)
        at org.archive.crawler.framework.CheckpointService.requestCrawlCheckpoint(CheckpointService.java:285)
...

But the checkpoint partially completes, so if you fix the problem (e.g. by adding an empty log file), and try to re-checkpoint, you get:

INFO: org.archive.crawler.framework.CheckpointService requestCrawlCheckpoint no progress since last checkpoint; ignoring [Fri May 28 10:29:13 GMT 2021]
no progress since last checkpoint; ignoring

Which is all well and good, except that no coherent/consistent checkpoint has been written, so it's impossible to resume the crawl.

That check is implemented here:

heritrix3/engine/src/main/java/org/archive/crawler/framework/CheckpointService.java

Lines 252 to 259 in 37ce8d6

    
           // prevent redundant auto-checkpoints when crawler paused or stopping 
        
           if(controller.isPaused() || controller.getState().equals(CrawlController.State.STOPPING)) { 
        
               if (controller.getStatisticsTracker().getSnapshot().sameProgressAs(lastCheckpointSnapshot)) { 
        
                   LOGGER.info("no progress since last checkpoint; ignoring"); 
        
                   System.err.println("no progress since last checkpoint; ignoring"); 
        
                   return null; 
        
               } 
        
           }

This could be addressed by making it possible to ignore the lastCheckpointSnapshot and force a checkpoint. Or, we could ensure that the lastCheckpointSnapshot field only gets updated after the checkpoint is completely successfully executed.

It is plausible that you might want to force a checkpoint even if no progress has been made, e.g. because it involved resolving an issue with the crawl state itself. But this seems like a rare exception.

The text was updated successfully, but these errors were encountered:

anjackson · 2021-05-28T11:37:48Z

Looks like this line sets the laskCheckpointSnapshot and it's in the finally so it always gets called:

heritrix3/engine/src/main/java/org/archive/crawler/framework/CheckpointService.java

Line 316 in 37ce8d6

lastCheckpointSnapshot = controller.getStatisticsTracker().getSnapshot();

It should be the last thing in the try clause.

…ternetarchive#392.

Only update last checkpoint stats if the checkpoint completed, for #392.

anjackson · 2021-06-07T10:14:35Z

The stats are now only written if the checkpoint executes without throwing an exception, which is as good as it's likely to get for now.

anjackson added a commit to ukwa/heritrix3 that referenced this issue May 28, 2021

Only update last checkpoint stats if the checkpoint completed, for in…

5caa83d

…ternetarchive#392.

anjackson added a commit that referenced this issue May 28, 2021

Merge pull request #393 from ukwa/checkpoint-success-stats-392

169b3cc

Only update last checkpoint stats if the checkpoint completed, for #392.

anjackson closed this as completed Jun 7, 2021

anjackson mentioned this issue Aug 6, 2021

Intermittent problems with log rotation #426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure valid checkpoints can be created when recovering from errors #392

Ensure valid checkpoints can be created when recovering from errors #392

anjackson commented May 28, 2021

anjackson commented May 28, 2021

anjackson commented Jun 7, 2021

Ensure valid checkpoints can be created when recovering from errors #392

Ensure valid checkpoints can be created when recovering from errors #392

Comments

anjackson commented May 28, 2021

anjackson commented May 28, 2021

anjackson commented Jun 7, 2021