Inappropriate check when wrapping layers for FSDP #2947

fc-jian · 2024-07-19T12:18:33Z

System Info

- `Accelerate` version: 0.32.1
- Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.17
- `accelerate` bash location: /home/user/miniforge3/envs/torch231/bin/accelerate
- Python version: 3.12.4
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 186.87 GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'HYBRID_SHARD_ZERO2', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

In the following code from utils/dataclasses.py

    def set_auto_wrap_policy(self, model):
        from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy

        default_transformer_cls_names_to_wrap = (
            ",".join(model._no_split_modules) if getattr(model, "_no_split_modules", None) is not None else ""
        )
        if self.auto_wrap_policy is None:
            auto_wrap_policy = os.environ.get("FSDP_AUTO_WRAP_POLICY", "NO_WRAP")
            if auto_wrap_policy == FSDP_AUTO_WRAP_POLICY[0]:
                transformer_cls_names_to_wrap = os.environ.get(
                    "FSDP_TRANSFORMER_CLS_TO_WRAP", default_transformer_cls_names_to_wrap
                ).split(",")
                transformer_cls_to_wrap = set()
                for layer_class in transformer_cls_names_to_wrap:
                    transformer_cls = get_module_class_from_name(model, layer_class)
                    if transformer_cls is None:
                        raise Exception("Could not find the transformer layer class to wrap in the model.")
        ...

It requires that all layers specified in model._no_split_modules must be observed in the model. However, several transformers models have variants that do not contain all types of layers specified in _no_split_modules (which is usually defined in the XXXPretrainedModel). This will lead to the Exception("Could not find the transformer layer class to wrap in the model.") even if my model does contain some (but not all) of the layers specified in model._no_split_modules.

I found this when playing with EsmModel in transformers, which defines a _no_split_modules layer named EsmFoldTriangularSelfAttentionBlock in EsmPretrainedModel, but not all ESM models contain this layer.

Expected behavior

I added the following code before constructing the Trainer to remove the layers in _no_split_modules that are not included in the model, and everything goes well.

from accelerate.utils.dataclasses import get_module_class_from_name
_update_wrap_layers = []
    for layer in model._no_split_modules:
        if get_module_class_from_name(model, layer) is not None:
            _update_wrap_layers.append(layer)
    model._no_split_modules = _update_wrap_layers

I have no idea if it should be a problem of accelerate or models in transformers, so I did not directly start a PR.

The text was updated successfully, but these errors were encountered:

iAaronLau · 2024-08-08T07:44:16Z

Thanks @fc-jian,

Here I found that FSDP_TRANSFORMER_CLS_TO_WRAP should not be defined in the environment (e.g. ~/.cache/huggingface/accelerate/default_config.yaml), or this code block will not take effect.

Because the transformer_cls_names_to_wrap will be overwritten by FSDP_TRANSFORMER_CLS_TO_WRAP at:

  transformer_cls_names_to_wrap = os.environ.get(
      "FSDP_TRANSFORMER_CLS_TO_WRAP", default_transformer_cls_names_to_wrap
  ).split(",")

where the default_transformer_cls_names_to_wrap as a default value, and FSDP_TRANSFORMER_CLS_TO_WRAP is not None.

muellerzr · 2024-08-08T11:04:07Z

Hi all, I think #2998 might assist with this, if I’m understanding right? :)

github-actions · 2024-09-12T15:06:56Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inappropriate check when wrapping layers for FSDP #2947

Inappropriate check when wrapping layers for FSDP #2947

fc-jian commented Jul 19, 2024 •

edited

Loading

iAaronLau commented Aug 8, 2024

muellerzr commented Aug 8, 2024

github-actions bot commented Sep 12, 2024

Inappropriate check when wrapping layers for FSDP #2947

Inappropriate check when wrapping layers for FSDP #2947

Comments

fc-jian commented Jul 19, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

iAaronLau commented Aug 8, 2024

muellerzr commented Aug 8, 2024

github-actions bot commented Sep 12, 2024

fc-jian commented Jul 19, 2024 •

edited

Loading