neox Flash attn #31

arnocandel · 2023-04-11T23:50:24Z

Enabling Flash attention
GPTLMHeadModel(
  (transformer): GPTModel(
    (embeddings): GPT2Embeddings(
      (word_embeddings): Embedding(50400, 4096)
    )
    (layers): ModuleList(
      (0-27): 28 x ParallelBlock(
        (mixer): MHA(
          (rotary_emb): RotaryEmbedding()
          (Wqkv): FusedDense(in_features=4096, out_features=12288, bias=False)
          (inner_attn): FlashSelfAttention()
          (inner_cross_attn): FlashCrossAttention()
          (out_proj): FusedDense(in_features=4096, out_features=4096, bias=False)
        )
        (dropout1): Dropout(p=0.0, inplace=False)
        (norm1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (mlp): FusedMLP(
          (fc1): Linear(in_features=4096, out_features=16384, bias=True)
          (fc2): Linear(in_features=16384, out_features=4096, bias=True)
        )
        (dropout2): Dropout(p=0.0, inplace=False)
      )
    )
    (drop_f): Dropout(p=0.0, inplace=False)
    (ln_f): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4096, out_features=50400, bias=True)
)

arnocandel · 2023-04-12T00:01:05Z

as is:
CUDA_VISIBLE_DEVICES=0 WORLD_SIZE=1 torchrun --nproc_per_node=1 --nnodes=1 finetune.py --data_path=merged_shuffled_OIG_87f6a1e788.json --num_epochs=0.2 --base_model=EleutherAI/gpt-j-6B --prompt_type=plain --data_mix_in_path=None --micro_batch_size=2 --batch_size=32 --cutoff_len=2048 --run_id=11 --flash_attention=True fails with
TypeError: GPTLMHeadModel.forward() got an unexpected keyword argument 'attention_mask'

If trying to steal stuff like this:
https://github.com/h2oai/h2o-llm/blob/864fd5fd61bb2ab574b14eb6146bcfd003cffba0/finetune.py#L324-L330
Then other errors.

arnocandel · 2023-04-20T21:27:33Z

Install GPT-NEOX

source ~/.bashrc.mamba
mamba create -n gptneox
conda activate gptneox
mamba install python=3.8 -y
mamba install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y
cd gpt-neox/
pip install -r requirements/requirements.txt
mamba install cudatoolkit-dev=11.7 cudatoolkit=11.7 -c conda-forge -c nvidia -y
unset CUDA_HOME
python ./megatron/fused_kernels/setup.py install
pip install -r ./requirements/requirements-flashattention.txt
cd ..
git clone https://github.com/EleutherAI/DeeperSpeed.git
cd DeeperSpeed
./install.sh
python prepare_data.py -d ./data
wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P 20B_checkpoints

Now can train, fine-tune, inference with Flash attention by changing the config file for neox to specify attention_type to flash.

diff --git a/configs/20B.yml b/configs/20B.yml
index 6595919..52dfbfb 100644
--- a/configs/20B.yml
+++ b/configs/20B.yml
@@ -14,12 +14,13 @@
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 4,
-  "model-parallel-size": 2,
+  "model-parallel-size": 1,
 
   # model settings
   "num-layers": 44,
   "hidden-size": 6144,
   "num-attention-heads": 64,
+  "attention_config": [[["flash"], 44]],
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",

The change to model parallel size is to use one pipeline per GPU, required to satisfy deep.py
Run generation like:

./deepy.py generate.py ./configs/20B.yml

Use fast attention with LLaMa from Vicunda/FastChat repo:

Special transformers hash

Patch1

Patch2

…ttention/blob/main/tests/models/test_gpt_neox.py

arnocandel · 2023-05-11T21:14:31Z

Flash attention now native in Torch 2.0.1 for float16.

pseudotensor force-pushed the main branch 7 times, most recently from edfa2ad to 1187a5c Compare April 21, 2023 11:35

pseudotensor changed the title ~~Flash attn~~ neox Flash attn Apr 29, 2023

arnocandel added 7 commits May 10, 2023 21:00

Add Flash attention code from https://github.com/HazyResearch/flash-a…

8a1f5ce

…ttention/blob/main/tests/models/test_gpt_neox.py

Add instructions to install Apex.

cec2ed6

Update instructions for flash attention

6e163e2

WIP - nothing working yet. Disable mix_in by default.

ff4307f

Remove manual install of flash-attn.

92335ed

Upgrade requirements, fixes sm80 issue.

bd1b009

Rebase, rename llama_flash_attn -> flash_attn.

af4a98b

arnocandel force-pushed the flash-attn branch from 864fd5f to af4a98b Compare May 11, 2023 06:07

arnocandel added 4 commits May 10, 2023 23:46

WIP. Add back custom install for flash-attn.

b9dcb7d

Cleanup.

ee79bfa

Revert name change.

e286ec3

Revert more changes.

cc43cc5

arnocandel marked this pull request as ready for review May 11, 2023 21:14

arnocandel merged commit 1e1540e into main May 11, 2023

arnocandel mentioned this pull request May 12, 2023

Speed investigation on A100 for flash attention on/off #128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neox Flash attn #31

neox Flash attn #31

arnocandel commented Apr 11, 2023 •

edited

Loading

arnocandel commented Apr 12, 2023

arnocandel commented Apr 20, 2023

arnocandel commented May 11, 2023 •

edited

Loading

neox Flash attn #31

neox Flash attn #31

Conversation

arnocandel commented Apr 11, 2023 • edited Loading

arnocandel commented Apr 12, 2023

arnocandel commented Apr 20, 2023

Install GPT-NEOX

Use fast attention with LLaMa from Vicunda/FastChat repo:

arnocandel commented May 11, 2023 • edited Loading

arnocandel commented Apr 11, 2023 •

edited

Loading

arnocandel commented May 11, 2023 •

edited

Loading