Due to the limited memory capacity of a single GPU, it may be impossible to load the entire MiniCPMV model (the model weights account for 18 GiB) onto one device for inference (assume one gpu only has 12 GiB or 16 GiB GPU memory). To address this limitation, multi-GPU inference can be employed, where the model's layers are distributed across multiple GPUs.
A minimal modification method can be used to achieve this distribution, ensuring that the layers are assigned to different GPUs with minimal changes to the original model structure.
To implement this, we utilize features provided by accelerate
library.
Install all requirements of MiniCPM-Llama3-V-2_5, additionally, you also need to install accelerate
.
pip install accelerate
Now we consider a demo for two GPUs, where each GPU has 16 GiB GPU memory.
- Import necessary libraries.
from PIL import Image
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model, dispatch_model
- Download model weights.
MODEL_PATH = '/local/path/to/MiniCPM-Llama3-V-2_5' # you can download in advance or use `openbmb/MiniCPM-Llama3-V-2_5`
- Determine the distribution of layers on multiple GPUs.
max_memory_each_gpu = '10GiB' # Define the maximum memory to use on each gpu, here we suggest using a balanced value, because the weight is not everything, the intermediate activation value also uses GPU memory (10GiB < 16GiB)
gpu_device_ids = [0, 1] # Define which gpu to use (now we have two GPUs, each has 16GiB memory)
no_split_module_classes = ["LlamaDecoderLayer"]
max_memory = {
device_id: max_memory_each_gpu for device_id in gpu_device_ids
}
config = AutoConfig.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
with init_empty_weights():
model = AutoModel.from_config(
config,
torch_dtype=torch.float16,
trust_remote_code=True
)
device_map = infer_auto_device_map(
model,
max_memory=max_memory, no_split_module_classes=no_split_module_classes
)
print("auto determined device_map", device_map)
# Here we want to make sure the input and output layer are all on the first gpu to avoid any modifications to original inference script.
device_map["llm.model.embed_tokens"] = 0
device_map["llm.model.layers.0"] = 0
device_map["llm.lm_head"] = 0
device_map["vpm"] = 0
device_map["resampler"] = 0
print("modified device_map", device_map)
You may see this output:
modified device_map OrderedDict([('llm.model.embed_tokens', 0), ('llm.model.layers.0', 0), ('llm.model.layers.1', 0), ('llm.model.layers.2', 0), ('llm.model.layers.3', 0), ('llm.model.layers.4', 0), ('llm.model.layers.5', 0), ('llm.model.layers.6', 0), ('llm.model.layers.7', 0), ('llm.model.layers.8', 0), ('llm.model.layers.9', 0), ('llm.model.layers.10', 0), ('llm.model.layers.11', 0), ('llm.model.layers.12', 0), ('llm.model.layers.13', 0), ('llm.model.layers.14', 0), ('llm.model.layers.15', 0), ('llm.model.layers.16', 1), ('llm.model.layers.17', 1), ('llm.model.layers.18', 1), ('llm.model.layers.19', 1), ('llm.model.layers.20', 1), ('llm.model.layers.21', 1), ('llm.model.layers.22', 1), ('llm.model.layers.23', 1), ('llm.model.layers.24', 1), ('llm.model.layers.25', 1), ('llm.model.layers.26', 1), ('llm.model.layers.27', 1), ('llm.model.layers.28', 1), ('llm.model.layers.29', 1), ('llm.model.layers.30', 1), ('llm.model.layers.31', 1), ('llm.model.norm', 1), ('llm.lm_head', 0), ('vpm', 0), ('resampler', 0)])
- Next, use the
device_map
to dispatch the model layers to corresponding gpus.
load_checkpoint_in_model(
model,
MODEL_PATH,
device_map=device_map)
model = dispatch_model(
model,
device_map=device_map
)
torch.set_grad_enabled(False)
model.eval()
- Chat!
image_path = '/local/path/to/test.png'
response = model.chat(
image=Image.open(image_path).convert("RGB"),
msgs=[
{
"role": "user",
"content": "guess what I am doing?"
}
],
tokenizer=tokenizer
)
print(response)
In this case the OOM (CUDA out of memory) problem may be eliminated. We have tested that:
- it works well for
3000
text input tokens and1000
text output tokens. - it works well for a high-resolution input image.
It is similar to the previous example, but you may consider modifying these two variables.
max_memory_each_gpu = '10GiB' # Define the maximum memory to use on each gpu, here we suggest using a balanced value, because the weight is not everything, the intermediate activation value also uses GPU memory (10GiB < 16GiB)
gpu_device_ids = [0, 1, ...] # Define which gpu to use (now we have two GPUs, each has 16GiB memory)
You may use the following shell script to monitor the memory usage during inference, if there is OOM, try to reduce max_memory_each_gpu
, to make memory pressure more balanced across all gpus.
watch -n1 nvidia-smi