-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add support for DBRX #3660
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@megha95 Thanks for submitting the PR! Very excited about the new release. Left minor comments on some stylistic issues. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for submitting the PR! Very excited to see what people will build on top of DBRX.
doesn't this need tiktoken to be installed? I guess that's not pushed yet to the docker image maybe? could the docker image be re-built? |
Hi guys, is there an example command that can be used to deploy an API server for DBRX-Base with vLLM? I have been trying with |
I had it running with TGI fine and can't remember if I got it working with
vLLM, but I did prepare this template:
https://runpod.io/console/deploy?template=bi8ao1ztys&ref=jmfkcdio , which
uses:
```
--model databricks/dbrx-instruct --max-model-len 4096 --port 8000
--trust-remote-code
```
That does strike me that it's missing gpus all....
If you're hitting memory issues, maybe consider trying --dtype of half - to
see if that forces bfloat16 or float16?
BTW, for fast memory download, the latest vLLM docker image (and indeed
requirements.txt in the main repo) include hf_transfer, so if you set
"HF_HUB_ENABLE_HF_TRANSFER" you'll get much faster download via rust.
…On Mon, Apr 22, 2024 at 11:04 AM Calvinn Ng ***@***.***> wrote:
Hi guys, is there an example command that can be used to deploy an API
server for DBRX-Base with vLLM? I have been trying with CUDA_VISIBLE_DEVICES=1,2,3,4
python -m vllm.entrypoints.openai.api_server --model databricks/dbrx-base
--tensor-parallel-size 4 --host localhost --port 12345
--gpu-memory-utilization 0.9 but it seems to be asking for 1EB of memory
with error message torch.cuda.OutOfMemoryError: CUDA out of memory. Tried
to allocate more than 1EB memory.
—
Reply to this email directly, view it on GitHub
<#3660 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CTCKWC2YMHLXKICIELY6TOEXAVCNFSM6AAAAABFLBITJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYHE4TINBUGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I see. Makes sense if you set the --max-model-len to 4096. I was doing 32k context length. On the topic of precision, vLLM automatically uses FP16 for FP16 model, and BF16 for BF16 model if I am not wrong. |
Hi, update: I found out what the issue is. It is due to this warning:
Because custom allreduce is disabled, the model did not manage to load. The solution is to add |
ok, wow thanks, that's pretty specific |
This PR adds supports for DBRX. DBRX is a Mixture-of-Experts (MoE) model trained by Databricks with 132B total parameters and 36B live parameters. More details about the model can be found at: DBRX Technical Blog
Model weights can be found in HF repo:
This PR is currently based off of an older commit because the latest main has some issues. Therefore, there are some minor merge conflicts that will be corrected soon.
Meanwhile, to run DBRX with vLLM, this PR can be used. It has been tested on NVIDIA A100 and H100 systems.
Note: Given model has 132B total parameters, it is suggested to use mininum 4x80GB GPU cards to run 16-bit inference. Try increasing
gpu_memory_utilization
if you are running on 4 GPUs.