Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: install additional fms-acceleration plugins #350

Merged
merged 6 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -647,10 +647,10 @@ The list of configurations for various `fms_acceleration` plugins:
- [quantized_lora_config](./tuning/config/acceleration_configs/quantized_lora_config.py): For quantized 4bit LoRA training
- `--auto_gptq`: 4bit GPTQ-LoRA with AutoGPTQ
- `--bnb_qlora`: 4bit QLoRA with bitsandbytes
- [fused_ops_and_kernels](./tuning/config/acceleration_configs/fused_ops_and_kernels.py) (experimental):
- [fused_ops_and_kernels](./tuning/config/acceleration_configs/fused_ops_and_kernels.py):
- `--fused_lora`: fused lora for more efficient LoRA training.
- `--fast_kernels`: fast cross-entropy, rope, rms loss kernels.
- [attention_and_distributed_packing](./tuning/config/acceleration_configs/attention_and_distributed_packing.py) (experimental):
- [attention_and_distributed_packing](./tuning/config/acceleration_configs/attention_and_distributed_packing.py):
- `--padding_free`: technique to process multiple examples in single batch without adding padding tokens that waste compute.
- `--multipack`: technique for *multi-gpu training* to balance out number of tokens processed in each device, to minimize waiting time.

Expand All @@ -663,6 +663,7 @@ Notes:
- pass `--fast_kernels True True True` for full finetuning/LoRA
- pass `--fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True` for GPTQ-LoRA
- pass `--fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True` for QLoRA
- Note the list of supported models [here](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/fused-ops-and-kernels/README.md#supported-models).
* Notes on Padding Free
- works for both *single* and *multi-gpu*.
- works on both *pretokenized* and *untokenized* datasets
Expand All @@ -671,6 +672,16 @@ Notes:
- works only for *multi-gpu*.
- currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*.

Note: To pass the above flags via a JSON config, each of the flags expects the value to be a mixed type list, so the values must be a list. For example:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in line 653: - attention_and_distributed_packing (experimental) we have mentioned it as experimental, but we are talking about releasing it to product with openshift 2.14, is it still experimental or ready for release @fabianlim @anhuong

Copy link
Collaborator Author

@anhuong anhuong Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I can mark these as ready in this PR as well and no longer experimental from earlier conversation with Fabian. Will wait on @fabianlim to review as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding free is already upstreamed to HF main. Instruct lab is using multipack, and this has been tested for up to about 500K samples in the dataset. Beyond that, I am not aware of the speed performance of multipack, as it runs through the lengths of each example before the start of every epoch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any issue to including these new plugins into product if the fused-op-and-kernels plugin uses Apache 2.0 license but (contains extracted code) from unsloth?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhuong yes that is a good point thanks for bringing this up.

  • unsloth is Apache 2.0, but we were disturbed by those "comments" peppered in the code.
  • we only extracted part of the unsloth code, and we did the extraction on a version that existed before those "comments" appeared (as far as we could tell)
  • all extracted portions contained the relevant License Notice headers credited to the owners of unsloth

Beyond what we have done, I am not any more knowledgable to say what is permissible and what is not. This requires a person knowledgable in these things to run through.

The peft plugin also contains a triton-only extraction of the ModelCloud fork of AutoGPTQ https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/accelerated-peft#gptq-loras-autogptq---current-implementation-vs-legacy-implementation. The fork is released as Apache 2.0

Copy link

@wynterl wynterl Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhuong Code scan should pass with no issue as regarding the inclusion of the new plug-ins, and as noted by @fabianlim unsloth is apache 2.0.

```json
{
"fast_kernels": [true, true, true],
"padding_free": ["huggingface"],
"multipack": [16],
"auto_gptq": ["triton_v2"]
}
```

Activate `TRANSFORMERS_VERBOSITY=info` to see the huggingface trainer printouts and verify that `AccelerationFramework` is activated!

```
Expand Down
5 changes: 5 additions & 0 deletions build/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,14 @@ RUN --mount=type=cache,target=/home/${USER}/.cache/pip,uid=${USER_UID} \
python -m pip install --user "$(head bdist_name)" && \
python -m pip install --user "$(head bdist_name)[flash-attn]"

# fms_acceleration_peft = PEFT-training, e.g., 4bit QLoRA
# fms_acceleration_foak = Fused LoRA and triton kernels
# fms_acceleration_aadp = Padding-Free Flash Attention Computation
RUN if [[ "${ENABLE_FMS_ACCELERATION}" == "true" ]]; then \
python -m pip install --user "$(head bdist_name)[fms-accel]"; \
python -m fms_acceleration.cli install fms_acceleration_peft; \
python -m fms_acceleration.cli install fms_acceleration_foak; \
python -m fms_acceleration.cli install fms_acceleration_aadp; \
fi

RUN if [[ "${ENABLE_AIM}" == "true" ]]; then \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ class AccelerationFrameworkConfig:
PaddingFree,
ConfigAnnotation(
path="training.attention",
experimental=True,
experimental=False,
required_packages=["aadp"],
),
] = None
Expand All @@ -112,7 +112,7 @@ class AccelerationFrameworkConfig:
MultiPack,
ConfigAnnotation(
path="training.dataloader",
experimental=True,
experimental=False,
required_packages=["aadp"],
),
] = None
Expand Down
Loading