-
Notifications
You must be signed in to change notification settings - Fork 955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infer_auto_device_map inefficiently allocates GPU memory for models with imbalanced module sizes #3041
Comments
I'm sorry if my original issue was unclear. This is my first bug report. I have modified it; hopefully, it's easier to understand now. I will explain the problem in a few sentences. |
Thanks for reporting this issue. I agree that this looks like there is room for improvement in order to allow as many modules as possible to be loaded on the fastest device. To me, this looks like a Knapsack problem, so finding an optimal solution could become quite interesting. But I'll let @muellerzr and @SunMarc comment on this, who have more background knowledge. |
Hey @Nech-C, thanks for the detailed report ! You have a very good understanding of the situation. This could indeed be improved as
This is required in case we perform cpu/disk offloading as we need to bring the largest offloaded layer to the gpu. One way to solve that is to create the device_map without the hypothesis that we will have offloaded layers. If we end up with offloaded layers, we redo the calculation with that hypothesis. Another solution would be check if the memory of the model < memory of the gpus. If that's not the case, we do the calculation without the hypothesis. However, we might still face issues with unbalanced models.
We can improve that part indeed. Since most transformers have balanced modules, It was working fine. For example, we can still consider coming back to the previous device if it has at least 10% of space that is available. The reasoning behind moving to the next device was to limit movement across devices as this will make inference slower. 1->2->3 and not 1->2->1->2->3. If you are up to the challenge, feel free to open a PR to fix those two points ! I can have a look later ! |
@SunMarc Thank you so much for your detailed response! I appreciate your insights into the problem. I'd love to take on this challenge. It might take me a little time to get up to speed with the library, but I'm excited to give it a try. Can I reach out with any questions as I work on this? |
Nice, thanks for helping ! Yes, feel free to ask any questions ! |
Hi @SunMarc, I've been digging into the code, and this is more complicated than I first thought. I agree that conditionally calculating the PR no. 1 (quick fix):
PR no. 2 (optimization): Do you think this approach sounds good to you? If you agree, I can start working on the first pr soon. Let me know if you have any suggestions or concerns about this plan! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Steps to Reproduce
infer_auto_device_map
with `no_split_module_classes=[] and the defined max_memory.Code example
Output:
As shown above, no module is allocated to the GPU while there is enough space on it, and there are modules that can be allocated to it.
Expected behavior
The function should allocate modules to the GPU if possible. When it does allocate modules to the GPU, it should efficiently utilize the space in the GPU for a model with imbalanced module sizes.
Here is a simple breakdown of the module sizes of the Segformer model:
When we increase the max_memory for the GPU by making the split ratio 0.9 (again, 0.9 is another split ratio used in the tests), some modules are allocated to the GPU in an insufficient way:
The space allocated to the GPU is significantly less than the defined max_memory for the GPU for both the 0.7 and 0.9 split cases.
After looking into the
infer_auto_device_map
function, I believe the logic might not be working as intended for models with highly imbalanced module sizes like Segformer:While trying to determine whether to allocate a
module
to the current device, this code reserves space for the largest layer on the current main device. In other words, the current device needs to have more memory than the size ofmodule
plus the size of the largest layer just formodule
to be allocated on it. For Segformer, where the decode_head (1,107,984 bytes) is significantly larger than other layers, this approach may be too conservative, leaving little room for other layers on the GPU.The module is allocated to the current device when the condition is met. Otherwise, it will try to split the module or move to the next device when the module cannot be split. However, once it moves to the next device (i.e., CPU), it never goes back to the GPU, even if there's available space. This could explain why smaller modules aren't being allocated to the GPU after the decode_head is moved to the CPU.
I encountered this issue while working to enable
device_map='auto'
for some models in the Transformers library. Offload tests for those models fail because the entire model is allocated to the CPU or disk. I have reported this problem in this issue. Since I am unfamiliar with this library, I don't know if this is the expected behavior of the function. Thank you for reading this!The text was updated successfully, but these errors were encountered: