Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer class: using the Accelerate launcher with Deepspeed #25356

Closed
2 of 4 tasks
nebrelbug opened this issue Aug 7, 2023 · 4 comments
Closed
2 of 4 tasks

Trainer class: using the Accelerate launcher with Deepspeed #25356

nebrelbug opened this issue Aug 7, 2023 · 4 comments
Labels

Comments

@nebrelbug
Copy link
Contributor

System Info

  • transformers version: 4.32.0.dev0
  • Platform: Linux-3.10.0-1160.92.1.el7.x86_64-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • Accelerate version: 0.22.0.dev0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: distributed using DeepSpeed

Who can help?

@ArthurZucker, @sgugger, @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I've written a very simple training loop using the HuggingFace Trainer class, in order to finetune LLaMA. Here's the code:

loop.py

from transformers import LlamaForCausalLM, LlamaTokenizer, Trainer, TrainingArguments
from utils.dataloader_example import load_data

MODEL_PATH = "/.../llama-30b-hf"

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH, legacy=False)
tokenizer.pad_token = tokenizer.eos_token

model = LlamaForCausalLM.from_pretrained(MODEL_PATH)

train_dataset, eval_dataset = load_data(tokenizer)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    learning_rate=2e-5,
    logging_steps=10,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()
trainer.evaluate()

model.save_pretrained("/.../finetunes/llama-7b-tinyllama")
tokenizer.save_pretrained("/.../finetunes/llama-7b-tinyllama")

utils/dataloader_example.py

from torch.utils.data import Dataset
import json

with open("utils/alpaca_data.json", "r") as f:
    alpaca_data = json.load(f)

alpaca_data = [item for item in alpaca_data if len(item["input"]) == 0]

eval_mark = int(len(alpaca_data) * 0.8)

class StringDataset(Dataset):
    def __init__(self, string_list, tokenizer, max_sequence_length):
        self.string_list = string_list
        self.tokenizer = tokenizer
        self.max_sequence_length = max_sequence_length

    def __len__(self):
        return len(self.string_list)

    def __getitem__(self, idx):
        text = self.string_list[idx]
        tokens = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=self.max_sequence_length,
            return_tensors="pt",
        )

        tokens["input_ids"] = tokens["input_ids"].squeeze()
        tokens["labels"] = tokens["input_ids"]
        tokens["attention_mask"] = tokens["attention_mask"].squeeze()
        return tokens

def process_data(data):
    return [
    """
### Instruction:
{instruction}

### Response:
{response}
""".format(
        instruction=input["instruction"], response=input["output"]
    ).strip()
    for input in data
]

training_data = process_data(alpaca_data[:eval_mark])
eval_data = process_data(alpaca_data[eval_mark:])

# Create datasets
def load_data(tokenizer):
    train_dataset = StringDataset(training_data, tokenizer, max_sequence_length=200)
    eval_dataset = StringDataset(eval_data, tokenizer, max_sequence_length=200)

    return train_dataset, eval_dataset

I can train smaller models, like LLaMA 7B, without using DeepSpeed. But in order to use LLaMA 30B, I've been trying to use DeepSpeed ZeRO-3 with the Accelerate launcher.

Here's my accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: '/home/bgubler7/.cache/huggingface/accelerate/ds_config.json'
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
# mixed_precision: fp16
num_machines: 1
num_processes: 1
use_cpu: false

And my DeepSpeed config:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

When I run the code using accelerate launch loop.py, it seems to use the CPUs for model loading. The node I'm running on has 8 GPUs.

Unfortunately, after the checkpoint shards has loaded, only one of the GPUs begins to fill up. This eventually results in a CUDA out of memory error. Am I configuring DeepSpeed incorrectly? I copied-and-pasted the configuration from the HuggingFace documentation.

Expected behavior

I'd expect that the 30B model would load, with parameters and optimizer offloaded to the CPUs. Then all GPUs would be utilized to some extent during the training loop.

@sgugger
Copy link
Collaborator

sgugger commented Aug 7, 2023

cc @pacman100

@pacman100
Copy link
Contributor

Hello @nebrelbug, please update the accelerateconfig to correclty use 8 GPUs as shown below:

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: '/home/bgubler7/.cache/huggingface/accelerate/ds_config.json'
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
# mixed_precision: fp16
num_machines: 1
- num_processes: 1
+ num_processes: 8
use_cpu: false

@nebrelbug
Copy link
Contributor Author

@pacman100 I updated my config and ran the code again. This time, all the GPUs filled up, but I'm still running into a CUDA out of memory error.

torch.cuda.OutOfMemoryError    : self.__all_gather_params(params_to_fetch, forward)CUDA out of memory. Tried to allocate 228.00 MiB (GPU 5; 79.15 GiB total capacity; 74.79 GiB already allocated; 28.44 MiB free; 77.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Am I configuring something wrong with fp16 or offload? I'm on a node with 8 A100 GPUs -- I believe I should be able to train even a 65B model, as long as I use half-precision.

@pacman100
Copy link
Contributor

Hello @nebrelbug, you need to use gradient checkpointing for training such a large model as the activations aren't offloaded and they take up a lot of GPU memory for long sequences. For further increasing the throughput, use Flash Attention V2 too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants