Proper use of Checkpoint to handle a SIGTERM or SIGINT #3286

drewoldag · 2024-09-24T05:39:29Z

drewoldag
Sep 24, 2024

A computing cluster where I'll be training some models with ignite uses a "condo model" such that I can request access to any resources not currently being used, but if the owner of those resources needs them, my jobs will be stopped within about 10-15 seconds.

This feels like the right place to use Checkpointing, but I'm not well enough versed in ignite to know how to shut down the trainer as part of my signal handler. Any advice would be very much appreciated!

My code looks something like this:

class Trainer():
    def __init__(self, config):
        signal.signal(signal.SIGINT, self.signal_handler)

    def run(self):
        """Run the training process for a given model and data loader.
        """
        self.model = model_cls(config=self.config, shape=data_loader.shape())

        # Create trainer, a pytorch-ignite `Engine` object
        self.trainer = self._create_trainer(self.model)

        self.trainer.add_event_handler(Events.EPOCH_COMPLETED, self.checkpoint_handler())

        # Run the training process
        self.trainer.run(dist_data_loader, max_epochs=self.config["model"]["epochs"])

    def checkpoint_handler(self):
        to_save = {
            'model': self.model,
            'optimizer': self.model.optimizer,
            'trainer': self.trainer
        }

        logger.info("Creating checkpoint.")
        return Checkpoint(
            to_save,
            DiskSaver(Path("./checkpoints"), require_empty=False),
            n_saved=1,
            global_step_transform=global_step_from_engine(self.trainer),
        )

    ### I believe that this is the area where I could use the most help
    def signal_handler(self, sig, frame):
        if sig == signal.SIGINT:
            logger.info("SIGINT received, creating checkpoint and exiting.")
            self.checkpoint_handler()
        self.trainer.terminate()
        sys.exit(0)


    def _create_trainer(self, model):
        # Get currently available device for training, and set the model to use it
        device = idist.device()
        # logger.info(f"Training on device: {device}")
        model = idist.auto_model(model)

        # Extract `train_step` from model, which can be wrapped after idist.auto_model(...)
        if type(model) == torch.nn.parallel.DistributedDataParallel:
            inner_train_step = model.module.train_step
        elif type(model) == torch.nn.parallel.DataParallel:
            inner_train_step = model.module.train_step
        else:
            inner_train_step = model.train_step

        # Wrap the `train_step` so that batch data is moved to the appropriate device
        def train_step(engine, batch):
            batch = batch.to(device) if isinstance(batch, torch.Tensor) else tuple(i.to(device) for i in batch)

            return inner_train_step(batch)

        # Create the ignite `Engine` object
        trainer = Engine(train_step)
        return trainer

What I was expecting to happen here was to call the run method of the Train class, and then ctrl-c at some point in the training run, and a checkpoint file would be produced.

Unfortunately that doesn't seem to work. Checkpoints are produced at the end of each epoch as expected. But when I ctrl-c, I see multiple instances of the log message "Creating checkpoint" printed out (one for each dataloader worker+1), and a usually (but not always) a stack trace that looks like the following.

^C[2024-09-23 22:21:55,233 fibad.train:INFO] SIGINT received, creating checkpoint and exiting.
[2024-09-23 22:21:55,233 fibad.train:INFO] SIGINT received, creating checkpoint and exiting.
[2024-09-23 22:21:55,233 fibad.train:INFO] Creating checkpoint.
[2024-09-23 22:21:55,233 fibad.train:INFO] Checkpoint directory: checkpoints
[2024-09-23 22:21:55,234 fibad.train:INFO] Creating checkpoint.
[2024-09-23 22:21:55,234 fibad.train:INFO] Checkpoint directory: checkpoints
Exception in thread Thread-2 (_pin_memory_loop):
Traceback (most recent call last):
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in _pin_memory_loop
    do_one_step()
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/site-packages/torch/utils/data/_utils/pin_memory.py", line 32, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/site-packages/torch/multiprocessing/reductions.py", line 496, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/multiprocessing/connection.py", line 519, in Client
    c = SocketClient(address)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/drew/miniconda3/envs/fibad/lib/python3.12/multiprocessing/connection.py", line 647, in SocketClient
    s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
Engine run is terminating due to exception: 0

I'm assuming that there must be an elegant way to shut down the engine, and I thought that trainer.terminate() was the way to do it, but this doesn't seem correct.

Am I just completely misusing this functionality? At the end of the day, I could just use the EPOCH_COMPLETED event to trigger the creation of a checkpoint, but it would be nice to be able to resume in the middle of an epoch if we're evicted from the hardware we're training on.

vfdev-5 · 2024-09-24T09:24:59Z

vfdev-5
Sep 24, 2024
Maintainer

Hi @drewoldag, thansk for asking a question, here is a working example of what you would like to do if I understood correctly:

import torch.nn as nn
from ignite.engine import Engine, Events
from ignite.handlers import Checkpoint, global_step_from_engine
from ignite.utils import setup_logger, logging


train_data = range(10)
eval_data = range(4)
max_epochs = 5


model = nn.Linear(10, 10)


def train_step(engine, batch):
    print(f"{engine.state.epoch} / {engine.state.max_epochs} | {engine.state.iteration} - batch: {batch}", flush=True)

trainer = Engine(train_step)

to_save = {
    "model": model,
}

checkpoint = Checkpoint(
    to_save, 
    "./checkpoints",
    n_saved=1,
    global_step_transform=global_step_from_engine(trainer)
)

import signal


@trainer.on(Events.ITERATION_STARTED(once=23))
def send_signal():
    signal.raise_signal(signal.SIGINT)


def terminate_training_and_checkpoint(*args, **kwargs):
    print("Call checkpoint object to save the model etc")
    checkpoint(trainer)
    print("Terminate training")
    trainer.terminate()


signal.signal(signal.SIGINT, terminate_training_and_checkpoint)


trainer.run(train_data, max_epochs=max_epochs)

Output:

1 / 5 | 1 - batch: 0
1 / 5 | 2 - batch: 1
1 / 5 | 3 - batch: 2
1 / 5 | 4 - batch: 3
1 / 5 | 5 - batch: 4
1 / 5 | 6 - batch: 5
1 / 5 | 7 - batch: 6
1 / 5 | 8 - batch: 7
1 / 5 | 9 - batch: 8
1 / 5 | 10 - batch: 9
2 / 5 | 11 - batch: 0
2 / 5 | 12 - batch: 1
2 / 5 | 13 - batch: 2
2 / 5 | 14 - batch: 3
2 / 5 | 15 - batch: 4
2 / 5 | 16 - batch: 5
2 / 5 | 17 - batch: 6
2 / 5 | 18 - batch: 7
2 / 5 | 19 - batch: 8
2 / 5 | 20 - batch: 9
3 / 5 | 21 - batch: 0
3 / 5 | 22 - batch: 1
Call checkpoint object to save the model etc
Terminate training

State:
	iteration: 23
	epoch: 3
	epoch_length: 10
	max_epochs: 5
	output: <class 'NoneType'>
	batch: 2
	metrics: <class 'dict'>
	dataloader: <class 'range'>
	seed: <class 'NoneType'>
	times: <class 'dict'>

ls ./checkpoints

model_23.pt

The problem in your code is mainly that you have to call Checkpoint instance with trainer argument to make the checkpoint:

class Trainer():
    def __init__(self, config):
        signal.signal(signal.SIGINT, self.signal_handler)

    def run(self):
        """Run the training process for a given model and data loader.
        """
        self.model = model_cls(config=self.config, shape=data_loader.shape())

        # Create trainer, a pytorch-ignite `Engine` object
        self.trainer = self._create_trainer(self.model)

        self.checkpointer = self.checkpoint_handler()
        self.trainer.add_event_handler(Events.EPOCH_COMPLETED, self.checkpointer)

        # Run the training process
        self.trainer.run(dist_data_loader, max_epochs=self.config["model"]["epochs"])

    def checkpoint_handler(self):
        to_save = {
            'model': self.model,
            'optimizer': self.model.optimizer,
            'trainer': self.trainer
        }

        logger.info("Creating checkpoint.")
        return Checkpoint(
            to_save,
            DiskSaver(Path("./checkpoints"), require_empty=False),
            n_saved=1,
            global_step_transform=global_step_from_engine(self.trainer),
        )

    def signal_handler(self, sig, frame):
        if sig == signal.SIGINT:
            logger.info("SIGINT received, creating checkpoint and exiting.")
            self.checkpointer(self.trainer)
        self.trainer.terminate()
        # this may be omitted?
        # sys.exit(0)

Hope this helps

2 replies

drewoldag Sep 24, 2024
Author

Great, thank you for the help here, I'll take a look at your suggestions and fold them into my code. I really appreciate it!

vfdev-5 Sep 24, 2024
Maintainer

Sounds good. If something is unclear with the current answer or for other questions, do not hesitate to open new discussions, issues or join our discord server: https://pytorch-ignite.ai/chat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper use of Checkpoint to handle a SIGTERM or SIGINT #3286

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Proper use of Checkpoint to handle a SIGTERM or SIGINT #3286

drewoldag Sep 24, 2024

Replies: 1 comment · 2 replies

vfdev-5 Sep 24, 2024 Maintainer

drewoldag Sep 24, 2024 Author

vfdev-5 Sep 24, 2024 Maintainer

drewoldag
Sep 24, 2024

Replies: 1 comment 2 replies

vfdev-5
Sep 24, 2024
Maintainer

drewoldag Sep 24, 2024
Author

vfdev-5 Sep 24, 2024
Maintainer