Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] add an option to log every function call to for debugging hang/crash in distributed inference #4079

Merged
merged 21 commits into from
Apr 18, 2024

Conversation

youkaichao
Copy link
Member

@youkaichao youkaichao commented Apr 15, 2024

We often receive issue report from users that programs hang or crash in certain scenario. It is very difficult to debug. This PR adds an option to run with export VLLM_TRACE_FUNCTION=1 , which will log every function call in vllm. This way, we can know the final function and call stack before program hangs or crashes, so we can easily tell what is the bug.

Sometimes the bug might happen in unexpected places, e.g. #4027 finds the hang is because s3 bucket read is too slow; #3916 finds the core dump is due to corrupted libnccl.so .

Hopefully, this PR will help debugging in the future. Also might help #4019 .

TODO

  • Ideally we should have some launch_id, as described in [RFC]: Interface and Abstraction for Distributed Inference Environment #3587 and we should place these logs under /tmp/launch_id/ . Then users can zip the whole directory so that we can easily debug it. (This is VLLM_INSTANCE_ID, and default is vllm-instance-{random_uuid()})
  • Should update the issue templates to instruct users to use this option, when they report bugs about hangs or crashes.

@youkaichao
Copy link
Member Author

Currently this functionality is only enabled for RayGPUExecutor, the only tensor-parallel executor.

We can enable it for all the executors, but would like to hear opinions on whether the effort is worthwhile.

@youkaichao
Copy link
Member Author

I'm going to set this as a release blocker for v0.4.1 , because we will introduce vllm-managed nccl library in this release. This is kind of tricky and we need more debugging functionality.

@simon-mo simon-mo mentioned this pull request Apr 18, 2024
9 tasks
vllm/utils.py Show resolved Hide resolved
vllm/worker/worker_base.py Outdated Show resolved Hide resolved
vllm/worker/worker_base.py Outdated Show resolved Hide resolved
vllm/worker/worker_base.py Outdated Show resolved Hide resolved
vllm/worker/worker_base.py Outdated Show resolved Hide resolved
(f"VLLM_TRACE_FUNCTION_for_process_{os.getpid()}"
f"_thread_{threading.get_ident()}_"
f"at_{datetime.datetime.now()}.log").replace(" ", "_"))
os.makedirs(os.path.dirname(log_path), exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved into enable_trace_function_call

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separate and simplify the logic in enable_trace_function_call so that it can be tested in a standalone way. The caller should be responsible for the logic of creating the log file path.

@youkaichao
Copy link
Member Author

@simon-mo thanks for the quick and detailed review!

@simon-mo simon-mo merged commit 8a7a3e4 into vllm-project:main Apr 18, 2024
44 of 46 checks passed
@youkaichao youkaichao deleted the trace_frame branch April 18, 2024 23:19
xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 19, 2024
…/crash in distributed inference (vllm-project#4079)

Co-authored-by: Simon Mo <simon.mo@hey.com>
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024
…/crash in distributed inference (vllm-project#4079)

Co-authored-by: Simon Mo <simon.mo@hey.com>
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024
…/crash in distributed inference (vllm-project#4079)

Co-authored-by: Simon Mo <simon.mo@hey.com>
alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024
…/crash in distributed inference (vllm-project#4079)

Co-authored-by: Simon Mo <simon.mo@hey.com>
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
…/crash in distributed inference (vllm-project#4079)

Co-authored-by: Simon Mo <simon.mo@hey.com>
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
…/crash in distributed inference (vllm-project#4079)

Co-authored-by: Simon Mo <simon.mo@hey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants