You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we stress test the VLLM model inference on NVIDIA A100 or H800, this problem will be reproduced.
🐛 Describe the bug
In the start_sync function, although the write operations to the start and end signals are volatile, this does not guarantee that the order of these two operations will not be reordered by NVCC. According to the CUDA Memory Consistency Model manual, the order of two ld.volatile operations is only guaranteed not to change when they read and write to the same memory location (link). The start and end signals are not even on the same cache line. Please see CUDA Memory Consistency Model .
Once the order of line 135 and line 140 is reversed, it's possible that other devices' write operations to set end to 1 could happen before this device's operation to set end to 0, resulting in a deadlock in the while loop on line 165.
To solve this problem, enhance the dependency between the operations on line 135 and line 140, for example:
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
When we stress test the VLLM model inference on NVIDIA A100 or H800, this problem will be reproduced.
🐛 Describe the bug
In the start_sync function, although the write operations to the start and end signals are volatile, this does not guarantee that the order of these two operations will not be reordered by NVCC. According to the CUDA Memory Consistency Model manual, the order of two ld.volatile operations is only guaranteed not to change when they read and write to the same memory location (link). The start and end signals are not even on the same cache line. Please see CUDA Memory Consistency Model .
Once the order of line 135 and line 140 is reversed, it's possible that other devices' write operations to set end to 1 could happen before this device's operation to set end to 0, resulting in a deadlock in the while loop on line 165.
To solve this problem, enhance the dependency between the operations on line 135 and line 140, for example:
Another way to solve likes this:
The text was updated successfully, but these errors were encountered: