Trainer class: using the Accelerate launcher with Deepspeed #25356

nebrelbug · 2023-08-07T17:49:28Z

System Info

transformers version: 4.32.0.dev0
Platform: Linux-3.10.0-1160.92.1.el7.x86_64-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.22.0.dev0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: distributed using DeepSpeed

Who can help?

@ArthurZucker, @sgugger, @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I've written a very simple training loop using the HuggingFace Trainer class, in order to finetune LLaMA. Here's the code:

loop.py

from transformers import LlamaForCausalLM, LlamaTokenizer, Trainer, TrainingArguments
from utils.dataloader_example import load_data

MODEL_PATH = "/.../llama-30b-hf"

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH, legacy=False)
tokenizer.pad_token = tokenizer.eos_token

model = LlamaForCausalLM.from_pretrained(MODEL_PATH)

train_dataset, eval_dataset = load_data(tokenizer)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    learning_rate=2e-5,
    logging_steps=10,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()
trainer.evaluate()

model.save_pretrained("/.../finetunes/llama-7b-tinyllama")
tokenizer.save_pretrained("/.../finetunes/llama-7b-tinyllama")

utils/dataloader_example.py

from torch.utils.data import Dataset
import json

with open("utils/alpaca_data.json", "r") as f:
    alpaca_data = json.load(f)

alpaca_data = [item for item in alpaca_data if len(item["input"]) == 0]

eval_mark = int(len(alpaca_data) * 0.8)

class StringDataset(Dataset):
    def __init__(self, string_list, tokenizer, max_sequence_length):
        self.string_list = string_list
        self.tokenizer = tokenizer
        self.max_sequence_length = max_sequence_length

    def __len__(self):
        return len(self.string_list)

    def __getitem__(self, idx):
        text = self.string_list[idx]
        tokens = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=self.max_sequence_length,
            return_tensors="pt",
        )

        tokens["input_ids"] = tokens["input_ids"].squeeze()
        tokens["labels"] = tokens["input_ids"]
        tokens["attention_mask"] = tokens["attention_mask"].squeeze()
        return tokens

def process_data(data):
    return [
    """
### Instruction:
{instruction}

### Response:
{response}
""".format(
        instruction=input["instruction"], response=input["output"]
    ).strip()
    for input in data
]

training_data = process_data(alpaca_data[:eval_mark])
eval_data = process_data(alpaca_data[eval_mark:])

# Create datasets
def load_data(tokenizer):
    train_dataset = StringDataset(training_data, tokenizer, max_sequence_length=200)
    eval_dataset = StringDataset(eval_data, tokenizer, max_sequence_length=200)

    return train_dataset, eval_dataset

I can train smaller models, like LLaMA 7B, without using DeepSpeed. But in order to use LLaMA 30B, I've been trying to use DeepSpeed ZeRO-3 with the Accelerate launcher.

Here's my accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: '/home/bgubler7/.cache/huggingface/accelerate/ds_config.json'
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
# mixed_precision: fp16
num_machines: 1
num_processes: 1
use_cpu: false

And my DeepSpeed config:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

When I run the code using accelerate launch loop.py, it seems to use the CPUs for model loading. The node I'm running on has 8 GPUs.

Unfortunately, after the checkpoint shards has loaded, only one of the GPUs begins to fill up. This eventually results in a CUDA out of memory error. Am I configuring DeepSpeed incorrectly? I copied-and-pasted the configuration from the HuggingFace documentation.

Expected behavior

I'd expect that the 30B model would load, with parameters and optimizer offloaded to the CPUs. Then all GPUs would be utilized to some extent during the training loop.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-08-07T19:47:52Z

cc @pacman100

pacman100 · 2023-08-08T08:55:53Z

Hello @nebrelbug, please update the accelerateconfig to correclty use 8 GPUs as shown below:

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: '/home/bgubler7/.cache/huggingface/accelerate/ds_config.json'
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
# mixed_precision: fp16
num_machines: 1
- num_processes: 1
+ num_processes: 8
use_cpu: false

nebrelbug · 2023-08-08T14:28:00Z

@pacman100 I updated my config and ran the code again. This time, all the GPUs filled up, but I'm still running into a CUDA out of memory error.

torch.cuda.OutOfMemoryError    : self.__all_gather_params(params_to_fetch, forward)CUDA out of memory. Tried to allocate 228.00 MiB (GPU 5; 79.15 GiB total capacity; 74.79 GiB already allocated; 28.44 MiB free; 77.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Am I configuring something wrong with fp16 or offload? I'm on a node with 8 A100 GPUs -- I believe I should be able to train even a 65B model, as long as I use half-precision.

pacman100 · 2023-08-21T06:18:37Z

Hello @nebrelbug, you need to use gradient checkpointing for training such a large model as the activations aren't offloaded and they take up a lot of GPU memory for long sequences. For further increasing the throughput, use Flash Attention V2 too

pacman100 added the solved label Aug 21, 2023

nebrelbug closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer class: using the Accelerate launcher with Deepspeed #25356

Trainer class: using the Accelerate launcher with Deepspeed #25356

nebrelbug commented Aug 7, 2023

sgugger commented Aug 7, 2023

pacman100 commented Aug 8, 2023

nebrelbug commented Aug 8, 2023

pacman100 commented Aug 21, 2023

Trainer class: using the Accelerate launcher with Deepspeed #25356

Trainer class: using the Accelerate launcher with Deepspeed #25356

Comments

nebrelbug commented Aug 7, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Aug 7, 2023

pacman100 commented Aug 8, 2023

nebrelbug commented Aug 8, 2023

pacman100 commented Aug 21, 2023