Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The end_sync operation inside the cross_device_reduce_2stage kernel sometimes deadlocks because it can't wait for the end signal. #5866

Open
JiantaoXu opened this issue Jun 26, 2024 · 2 comments
Labels
bug Something isn't working stale

Comments

@JiantaoXu
Copy link

JiantaoXu commented Jun 26, 2024

Your current environment

When we stress test the VLLM model inference on NVIDIA A100 or H800, this problem will be reproduced.

🐛 Describe the bug

In the start_sync function, although the write operations to the start and end signals are volatile, this does not guarantee that the order of these two operations will not be reordered by NVCC. According to the CUDA Memory Consistency Model manual, the order of two ld.volatile operations is only guaranteed not to change when they read and write to the same memory location (link). The start and end signals are not even on the same cache line. Please see CUDA Memory Consistency Model .
Once the order of line 135 and line 140 is reversed, it's possible that other devices' write operations to set end to 1 could happen before this device's operation to set end to 0, resulting in a deadlock in the while loop on line 165.
To solve this problem, enhance the dependency between the operations on line 135 and line 140, for example:

self_sg->end[blockIdx.x][threadIdx.x] = 0;
__threadfence_block();   
sg.signals[threadIdx.x]->start[blockIdx.x][rank] = 1;

Another way to solve likes this:

self_sg->end[blockIdx.x][threadIdx.x] = 0;
sg.signals[threadIdx.x]->start[blockIdx.x][rank] = self_sg->end[blockIdx.x][threadIdx.x] + 1;

image

@youkaichao
Copy link
Member

@JiantaoXu sorry for the late response. I think #8558 should fix it?

Copy link

github-actions bot commented Jan 9, 2025

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

2 participants