-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: mismatch between multimodal tokens and placeholders for Llava-Next (4 GPUs) #8421
Comments
Can you post the images in the batch that is causing the error? |
Strangely, even if I wrap the Is this expected? |
I think if there is an internal failure inside the model, the whole vLLM engine needs to be started anew. You can try to narrow down the batch number that causes the error and post the corresponding images. |
Hmm quite strangely, it doesn't happen when using a single GPU. Does that sound similar? |
I haven't heard of such issues resulting from using multiple GPUs. Another thing you can try is to increase |
It is already at 32k. |
It would greatly help debugging if you could identify which batch is consistently causing this error. |
Yeah I am trying. I am unable to find any outputs I am capturing with |
A few things you can try:
|
Can you check whether #8496 fixes the issue for you? |
Thanks, will try when I get a moment. |
Seems to be working. |
Will close this issue as #8496 seems to be working beautifully. I wanted to ask a silly question hence not opening a new issue. I have the following simple script: Codeimport os
import queue
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import fire
from data_processing import initialize_dataloader
from model import load_vllm_engine, infer
from utils import save_results
def main(
data_path: str,
batch_size: int = 48,
dataloader_num_workers: int = 8,
output_dir: str = "sample_outputs",
max_tokens: int = 120,
detect_watermarks: bool = False,
):
vllm_engine, sampling_params = load_vllm_engine(max_tokens=max_tokens)
dataloader = initialize_dataloader(
data_path=data_path,
batch_size=batch_size,
dataloader_num_workers=dataloader_num_workers,
output_dir=output_dir,
detect_watermarks=detect_watermarks,
)
output_queue = queue.Queue()
save_thread = ThreadPoolExecutor(max_workers=dataloader_num_workers)
os.makedirs(output_dir, exist_ok=True)
save_future = save_thread.submit(save_results, output_queue, output_dir)
try:
print("Starting the generation process.")
for batch in tqdm(dataloader):
batch["sampling_params"] = sampling_params
try:
outputs = infer(vllm_engine, batch)
if outputs is not None:
original_captions = batch["original_captions"]
img_bytes = batch["img_bytes"]
img_hashes = batch["img_hashes"]
output_queue.put((original_captions, outputs, img_bytes, img_hashes))
except:
continue
finally:
output_queue.put(None)
save_thread.shutdown(wait=True)
save_future.result()
print("All processes completed. Captions generation and saving done.")
if __name__ == "__main__":
fire.Fire(main) Once it finishes execution on multiple GPUs successfully, I get: All processes completed. Captions generation and saving done.
ERROR 09-19 02:37:04 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 404188 died, exit code: -15
INFO 09-19 02:37:04 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W919 02:37:09.827306477 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/fsx/sayak/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown Warnings could likely be ignored but I see an "ERROR". Should |
cleanup errors that can be safely ignored. |
Your current environment
The output of `python collect_env.py`
Running my original script through SLURM so that is why the output above doesn't have any GPUs. I am on 4 H100s.
Model Input Dumps
No response
🐛 Describe the bug
Similar to #7996, I am running into when using Llava-NexT:
All the code is here:
https://github.com/sayakpaul/simple-image-recaptioning
This is why I launch it:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: