-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA driver error: an illegal memory access was encountered #15
Comments
It turns out that fsdp requires a slightly different approach when initializing after sharding. I'll open a PR to fix this on Monday |
@daviswer did we revisit this thread? just making sure if this issue is still there. |
I think #18 changed the calculus for how we were planning to handle this, and after that it never got revisited. Not sure if the issue is still relevant. |
@daviswer yes. that pr helps this issue by moving So this will have to be fixed eventually. |
So the way we'd want to do this is to add the (various)
But we need to make sure it keeps playing nicely with the the |
@daviswer technically, since you named it Can you prepare a small validation code snippet (to be called after FSDP call) to validate if the model is init as expected? e.g. it should pass with current code, it should not pass with current code but removing |
ok I opened a branch of fms main: |
@daviswer great. I will start working on this. |
closing this one in favor of the new issue: #64 |
@daviswer It seems calling model.reset_parameters() after FSDP call will raise the following error.
Can you take a look?
Moving it before the FSDP won't trigger this error.
The text was updated successfully, but these errors were encountered: