Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepcopy on Accelerator to return self #1694

Merged
merged 3 commits into from
Jul 11, 2023
Merged

Deepcopy on Accelerator to return self #1694

merged 3 commits into from
Jul 11, 2023

Conversation

muellerzr
Copy link
Collaborator

@muellerzr muellerzr commented Jul 7, 2023

Summary of slack thread:

A skorch user had some issues with using accelerate. To cut the story short, the general setup is multi-gpu training on a single node. During evaluation, when accelerator.gather_for_metrics is called, the excess samples are not being truncated. I was digging through the code and saw the check here:

if self.use_distributed:
if self.gradient_state.remainder == -1:
logger.info(
"The used dataset had no length, returning gathered tensors. You should drop the remainder yourself."
)
return tensor
try:
# Then see if we're on the last batch of our eval dataloader
if self.gradient_state.end_of_dataloader and self.gradient_state.remainder > 0:
# Last batch needs to be truncated on distributed systems as it contains additional samples
def _adjust_samples(tensor):
return tensor[: self.gradient_state.remainder]
return recursively_apply(_adjust_samples, tensor)

The problem is that the accelerator.gradient_state never thinks that end_of_dataloader is True and remainder is always -1, so naturally the condition is never triggered. Strangely, when I inspect the gradient_state of the DataLoader instance, it correctly determines end_of_dataloader and remainder. If I do a hacky accelerator.gradient_state = dataloader.gradient_state, the issue goes away.
I suspect that the two gradient states should not diverge. Probably the prepare step is somehow incorrect, leading to this. Any ideas if my suspicion is correct and do I need to pay attention to anything specific during the prepare step?

Post discussion:

I found a potentially simple solution to the problem, but I wanted to check first if you believe it works correctly. The solution is to add a __deepcopy__ method to the Accelerator class, so basically:

class MyAccelerator(Accelerator):
    def __deepcopy__(self, memo):
        cls = type(self)
        instance = cls()  # <= this ignores non default args for now
        return instance

I'm not sure what deepcopy does by default that's different, but this solves the problem. Do you think this is safe to use and would it be possible to add a deepcopy method to Accelerator?
I'm attaching a minimal example to reproduce the error and the fix (it still contains sklearn stuff, but skorch is completely removed, so should be mostly straightforward). Without the fix in line 120, I get the error described above with the gradient states diverging.

Solution:

The proposed solution (which works) is to have Accelerator use a __deepcopy__ method which will return self. Considering we only want to initialize an accelerator object once (unless under very extreme circumstances), returning self here is what we want because after deepcopying the ID for everything is exactly the same, so it's just another reference object with everything in-tact.

If this somehow breaks what downward users are doing, it's good to have this here because users shouldn't be really deepcopying the Accelerator object anyways, as that adds headaches with clearing memory, stored references, etc, etc.

cc @BenjaminBossan

Alternative:

Another alternative approach which is much more isolated than all of deepcopy is to use __sklearn_clone__, which is specific to sklearn and shouldn't break the existing code which should do the same, or have skorch apply their own Accelerator class which adds this, and removes the issue from accelerate directly

@muellerzr muellerzr requested a review from sgugger July 7, 2023 15:34
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 7, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me! I don't think this should break anything, but we can keep the second solution in mind if it ends up breaking another users's stuff.

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
@muellerzr muellerzr merged commit a4c2654 into main Jul 11, 2023
26 checks passed
@muellerzr muellerzr deleted the deepcopy branch July 11, 2023 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants