Set the state device dependant to Accelerator on multigpu #1220

muellerzr · 2023-03-20T14:57:11Z

Solves #1210 by setting the optimizer state to accelerator.device when num_processes > 1 and calling load_state.

HuggingFaceDocBuilderDev · 2023-03-20T15:02:30Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Mmm, if we add a new flag to the load method, we should make it default smartly. Also not sure if that new flag needs to be a string since it has two states (apart from unset): CPU or device. So maybe an optional bool would suffice?

sgugger · 2023-03-20T15:02:39Z

src/accelerate/accelerator.py

@@ -2385,8 +2385,17 @@ def load_state(self, input_dir: str, **load_model_func_kwargs):
        for hook in self._load_model_state_pre_hook.values():
            hook(models, input_dir)

+        optimizer_map_location = "cpu" if self.num_processes < 2 else self.device


You need something special for TPUs (but maybe TPUs don't use that path?)

Since we're thinking of having this handled with a bool instead, getting the device from PartialState.device is now what works. The AcceleratorState/PartialState already has the right device needed to do the move:
https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L110

src/accelerate/checkpointing.py

sgugger · 2023-03-20T17:06:30Z

src/accelerate/checkpointing.py

+    if optimizer_map_location is None:
+        optimizer_map_location = "cpu"
+    elif optimizer_map_location == "on_device":


I still think we should default to "on_device" for distributed trainings on GPU, otherwise we require num_processes times the optimizer state available in CPU RAM.

muellerzr · 2023-04-06T16:42:50Z

src/accelerate/checkpointing.py

+        if map_location != "cpu":
+            models[i].to(map_location)
+        models[i].load_state_dict(torch.load(input_model_file, map_location=map_location), **load_model_func_kwargs)


PyTorch will load the optimizer state based on the mapping to the models parameters, so the model needs to be on the map_location first if it's not CPU for it to work.

I'd let the error pop by itself. If the models are not on the right device, there should be an error.

sgugger

Thanks for iterating, left a couple more comments.

sgugger · 2023-04-06T16:54:13Z

src/accelerate/checkpointing.py

        load_model_func_kwargs (`dict`, *optional*):
            Additional arguments that can be passed to the model's `load_state_dict` method.
    """
+    if map_location not in [None, "cpu", "on_device"]:
+        raise TypeError(
+            "Unsupported optimizer map location passed, please choose one of `None`, `cpu`, or `on_device`"


Put the quotes around the strings here please.

sgugger · 2023-04-06T16:55:02Z

src/accelerate/checkpointing.py

+        if map_location != "cpu":
+            models[i].to(map_location)
+        models[i].load_state_dict(torch.load(input_model_file, map_location=map_location), **load_model_func_kwargs)


I'd let the error pop by itself. If the models are not on the right device, there should be an error.

Set the state device dependant to Accelerator on multigpu

ab39e4b

muellerzr added the enhancement New feature or request label Mar 20, 2023

muellerzr requested a review from sgugger March 20, 2023 14:57

Use map location

e862ccd

sgugger reviewed Mar 20, 2023

View reviewed changes

muellerzr added 2 commits March 20, 2023 11:39

Use on_device and PartialState

4399067

Move import

480e9e3

muellerzr requested a review from sgugger March 20, 2023 15:56

sgugger reviewed Mar 20, 2023

View reviewed changes

muellerzr added 3 commits March 21, 2023 08:41

Check for multi-gpu

1e6f512

Change logic in accelerator as well

f0e40ff

Pop from load_model_func kwargs

a4b81e0

muellerzr requested a review from sgugger March 21, 2023 12:58

muellerzr added 2 commits April 6, 2023 12:25

Working

86de895

Add tests

f92b48c

muellerzr commented Apr 6, 2023

View reviewed changes

sgugger reviewed Apr 6, 2023

View reviewed changes

Working, needed to change model device

75466f6

sgugger approved these changes Apr 6, 2023

View reviewed changes

muellerzr added 3 commits April 6, 2023 17:42

move to main to be ran multicuda

662ad59

Call test

24e95c9

Only CUDA

d021379

muellerzr merged commit b757b62 into main Apr 6, 2023

muellerzr deleted the loading-state branch April 6, 2023 18:00

muellerzr mentioned this pull request Apr 6, 2023

Detected 1 oom-kill event(s) in StepId=125757.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. #1210

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set the state device dependant to Accelerator on multigpu #1220

Set the state device dependant to Accelerator on multigpu #1220

muellerzr commented Mar 20, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 20, 2023 •

edited

Loading

sgugger left a comment

sgugger Mar 20, 2023

muellerzr Mar 20, 2023

sgugger Mar 20, 2023

muellerzr Apr 6, 2023

sgugger Apr 6, 2023

sgugger left a comment

sgugger Apr 6, 2023

sgugger Apr 6, 2023

Set the state device dependant to Accelerator on multigpu #1220

Set the state device dependant to Accelerator on multigpu #1220

Conversation

muellerzr commented Mar 20, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Mar 20, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger Mar 20, 2023

Choose a reason for hiding this comment

muellerzr Mar 20, 2023

Choose a reason for hiding this comment

sgugger Mar 20, 2023

Choose a reason for hiding this comment

muellerzr Apr 6, 2023

Choose a reason for hiding this comment

sgugger Apr 6, 2023

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger Apr 6, 2023

Choose a reason for hiding this comment

sgugger Apr 6, 2023

Choose a reason for hiding this comment

muellerzr commented Mar 20, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 20, 2023 •

edited

Loading