Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should I do 'accelerator.prepare()' again after 'accelerator.unwrap_model()' ? #948

Closed
DtYXs opened this issue Dec 29, 2022 · 11 comments
Closed

Comments

@DtYXs
Copy link

DtYXs commented Dec 29, 2022

For multi-gpu training:
model = accelerator.prepare(model)
I want to save ckpt during training. Then I do this:
model = accelerator.unwrap_model(model)
pipeline = Pipeline(model=model)
pipeline.save_pretrained(...)
And I want to continue training. Should I do model = accelerator.prepare(model) again after saving?

@pacman100
Copy link
Contributor

Hello,

below is the code structure of follow:

model = accelerator.prepare(model)

for epoch in range(num_epochs):
    # train loop
    
    # eval loop
    
    # saving checkpoint
    unwrapped_model = accelerator.unwrap_model(model)
    
    pipeline = Pipeline(model=unwrapped_model)
    pipeline.save_pretrained(...)

   # alternatively you can save using `unwrapped_model.save_pretrained(...)`

Let us know if that solves your query. For such queries in future, please use https://discuss.huggingface.co/c/accelerate/18

@DtYXs
Copy link
Author

DtYXs commented Dec 29, 2022

Hello,

below is the code structure of follow:

model = accelerator.prepare(model)

for epoch in range(num_epochs):
    # train loop
    
    # eval loop
    
    # saving checkpoint
    unwrapped_model = accelerator.unwrap_model(model)
    
    pipeline = Pipeline(model=unwrapped_model)
    pipeline.save_pretrained(...)

   # alternatively you can save using `unwrapped_model.save_pretrained(...)`

Let us know if that solves your query. For such queries in future, please use https://discuss.huggingface.co/c/accelerate/18

Thanks for your helpful reply.
If the operation of unwrapped_model affect model?
Actually, in my code, if I continue to train the original model after doing this, there will be problems. I want to know whether this unwrapped_model is completely independent of model, and if not, how can I save the ckpt multiple times without destroying the training process, not only after the full training and then save the ckpt once.
For short, I want to save ckpt multiple times in the training process. But accelerator.unwrap_model(model) seems cannot work well.

@pacman100
Copy link
Contributor

Hello @DtYXs, can you please provide a minimal reproducible example of the issue that you are facing?

@miquelmarti
Copy link

miquelmarti commented Jan 18, 2023

I'm having problems with this too. When using mixed precision with fp16 and no distributed training, after I use unwrap_model to save a checkpoint after one epoch running at close to max memory usage, immediately on the next forward pass of training I get OOM. If I don't unwrap, no OOM. I am looking into it but I'm pretty sure the culprit is around here:

original_forward = model.__dict__.pop("_original_forward", None)

unwrap_model seems to affect the passed model and remove its autocast etc wrapper for the forward pass.

@sgugger
Copy link
Collaborator

sgugger commented Jan 18, 2023

Yes it does, this was added recently by @muellerzr.
Maybe we should have a flag in unwrap_model so the user can select if they want model hooks to be removed or not.

@miquelmarti
Copy link

One would expect they get removed on the returned object, but never on the object you pass to this function, or?

@sgugger
Copy link
Collaborator

sgugger commented Jan 18, 2023

It's tricky since a model is a reference type, and we can't really create a whole copy of the model.

@muellerzr
Copy link
Collaborator

@miquelmarti what Sylvain already describes is actually supported. Just do:

model = accelerator.unwrap_model(model, keep_fp32_wrapper=True)

Will this work for you? :)

@miquelmarti
Copy link

miquelmarti commented Jan 19, 2023

I understand and yes, this flag does work for me. However, the behaviour before the change was equivalent to setting the flag to True and IMO that should be the default now as well so that unwrapping the model does not affect the next calls to forward. I think train -> checkpoint unwrapped model -> keep training is probably the most common case of unwrap_model. If you have good reason for it not being set to True by default, some warning on the documentation would probably save people upgrading or having expectations similar to mine some debugging time :')

EDIT: I guess #858 is the reason to remove the hooks by default. If I remove the hook using the flag I have the same problem as described in the issue when I do intermediate checkpoints, if I don't training cannot continue in the same way. There should be a way to allow for both.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@JeS24
Copy link

JeS24 commented Jul 15, 2024

@muellerzr @sgugger Was any consensus reached on what to do here? I am facing the same dilemma as @miquelmarti and would rather prefer that unwrapping and checkpointing the model (using save_pretrained()) does not impact its ability to continue training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants