feat: Add deps to evaluate qLora tuned model #312

aluu317 · 2024-08-23T18:09:57Z

Description of the change

This PR:

adds a new subpackage gptq to include dependencies of auto_gptq and optimum needed to load a quantized model. Mostly needed to run inference with.
Updates the run_inference.py to properly load a quantized model
Updates the Dockerfile to add an additional flag for enabling gptq, and therefore installs the subpackage mentioned above.

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

anhuong

Thanks Angel! Can you describe what testing you have done with this? And I left some questions that would be good to pose to Fabian and Aaron

anhuong · 2024-09-05T21:37:11Z

pyproject.toml

@@ -45,6 +45,7 @@ dev = ["wheel>=0.42.0,<1.0", "packaging>=23.2,<25", "ninja>=1.11.1.1,<2.0", "sci
 flash-attn = ["flash-attn>=2.5.3,<3.0"]
 aim = ["aim>=3.19.0,<4.0"]
 fms-accel = ["fms-acceleration>=0.1"]
+gptq = ["auto_gptq>0.4.2", "optimum>=1.15.0"]


let's call this gptq-dev instead to note that these are not needed at training time but post. Adding a note in documentation to note that these are needed for HF loading and inference would also be useful.

anhuong · 2024-09-05T22:20:33Z

build/Dockerfile

+RUN if [[ "${ENABLE_GPTQ}" == "true" ]]; then \
+    python -m pip install --user "$(head bdist_name)[gptq]"; \
+    fi
+


Although this is a nice flag to have, I don't think we need it as users won't be enabling this flag. Since this is similar to a dev dependency, it's something users would have to manually install themselves if they want to use it. Doesn't hurt to have it here but wanted to note how it's different from aim and fms-acceleration.

anhuong · 2024-09-05T22:30:56Z

scripts/run_inference.py

                    )
+                    if is_quantized:
+                        gptq_config = GPTQConfig(bits=4, exllama_config={"version": 2})


Can you describe what this is doing? Loading with 4bit GPTQ, what is the exllama_config version? It looks like its an exllama kernel?

yes this should be loading 4 bit GPTQ with exllama

@aluu317 This should also be added to be able to load a base or fine tuned quantized model for inference not just for adapters. This logic should also exist in the else case

Done. I added the case for loading a base model and tested with command:

python run_inference.py --text "This is a text" --use_flash_attn --model /testing/models/granite-34b-gptq

did not throw error.

Not sure about whether we can fine tune a quantized model to run inference on, maybe we will have that support?

agreed im not sure about fine tuning quanitized models...but its nice to have to run inference against base models

anhuong · 2024-09-05T22:34:28Z

scripts/run_inference.py

+                            attn_implementation="flash_attention_2"
+                            if use_flash_attn
+                            else None,
+                            device_map="cuda:0",


Is this specifying GPU 0? Can this just be "cuda", why would we specify the GPU? In addition, can quantized models be loaded on CPUs?

yes i think its better to just set this to cuda. If you specifically set cuda:0, then it might only be loaded on a single GPU, even if the machine has multiple gpus.

Alternatively, you might want to consider device_map="auto", which will automatically fill up all the GPUs, https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference

I dont think quantized models can be loaded on CPUs without special libraries (like llamacpp) that are not part of HF.

@fabianlim When I did device_map="auto", I see this error from torch:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

Am I missing something else?

@anhuong "cuda" works, updated!

anhuong · 2024-09-05T22:40:27Z

scripts/run_inference.py

+                            if use_flash_attn
+                            else None,
+                            device_map="cuda:0",
+                            torch_dtype=torch.float16 if use_flash_attn else None,


Does this always have to be torch.float16 and is this relevant to flash-attn being set? I know Fabian and Aaron set float16, is this something needed for loading quantized models?

I know in the fms-acceleration docs:

When setting --auto_gptq triton_v2 plus note to also pass --torch_dtype float16 and --fp16, or an exception will be raised. This is because these kernels only support this dtype.

Is this also true here for loading the model?

@anhuong we are in the process of making a new plugin release, and we re-opened an investigation on the exact conditions this will be needed. We may have the flexibility to also tune with auto_gptq triton_v2 in bfloat16. We will report more when we find out very soon.

But in the meantime, we have tested float16 extensively and can confirm the above setting will work, but it will be nice if this restriction can be also relaxed.

Since we don't use fms-acceleration for loading tuned models for evaluation, then using the kernels available from HF/AutoGPTQ directly will require dtype to be compatible with whatever kernel you use. In this case, exllama is a fp16 kernel and requires dtype to be float16

@anhuong sorry i think our responses are abit confusing

in the fms-accel docs, we set --auto_gptq triton_v2 because we are using different kernels for training (i.e., not exllama as in run_inference.py.

when usin--auto_gptq triton_v2 we specified in docs that use float16, but this is under investigation and could be relaxed.

when using exllama, you must use float16.

However, the trained checkpoint if loaded in some other inference framework (.e.g, vllm) can be loaded in other types as they are supported.

To conclude, the type depends on what inference framework you load the checkpoint in. Here run_inference loads in exllama, so you must respect the type it supports.

thank you, these details are very helpful to understand

Based on these details above it sounds like this should always be set to float16 even if flash-attn is not used, @aluu317

Ahh yup, updated!

anhuong · 2024-09-05T22:41:27Z

scripts/run_inference.py

+                    is_quantized = os.path.exists(
+                        os.path.join(base_model_name_or_path, "quantize_config.json")
                    )
+                    if is_quantized:


Would be interested to know if these are common across other quantized models and if the configurations set below work for other quantize tuning techniques.

The config here is specific only to GPTQ for setting the faster exllamav2 kernel. HF's from_pretrained will accept a quantization_config argument from any of these configs, so it might be better to generalize this in future.

@achew010 If they do not use anything else other than gptq then this is fine.

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

anhuong

Changes look good! THe only question I have is that I think torch_dtype=torch.float16 if use_flash_attn else None, should just be torch_dtype=torch.float16 for loading quantized model case, please correct me if im wrong

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

anhuong

Thanks Angel!

aluu317 force-pushed the qlora_dev_deps branch 3 times, most recently from 643b4db to af0b22e Compare August 26, 2024 17:58

aluu317 changed the title ~~Add support to load qLora tuned model~~ feat: Add support to load qLora tuned model Aug 26, 2024

aluu317 force-pushed the qlora_dev_deps branch 4 times, most recently from 23c71a2 to 6bde694 Compare August 30, 2024 14:30

aluu317 changed the title ~~feat: Add support to load qLora tuned model~~ feat: Add deps to evaluate qLora tuned model Aug 30, 2024

aluu317 force-pushed the qlora_dev_deps branch from c874682 to 5089326 Compare September 4, 2024 17:40

aluu317 added 2 commits September 4, 2024 12:05

Add support to load qLora tuned model in run_inference.py script

9b97ed3

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Fix conflict

16aafc9

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

aluu317 force-pushed the qlora_dev_deps branch from eb3399e to 16aafc9 Compare September 4, 2024 18:10

Remove comment

afdd9b0

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

aluu317 marked this pull request as ready for review September 4, 2024 18:16

aluu317 requested review from anhuong, Ssukriti and alex-jw-brooks as code owners September 4, 2024 18:16

Disable gptq by default

c75f433

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

anhuong requested review from fabianlim and removed request for alex-jw-brooks September 5, 2024 21:35

anhuong requested changes Sep 5, 2024

View reviewed changes

aluu317 added 6 commits September 11, 2024 14:55

Remove the gptq-dev install in Dockerfile

790043e

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Rename gptq-dev package from gptq

533400f

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Add comments in run_inference.py

cc9f2a1

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Update device to cuda

6e9e7f2

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Add in the case that there's no adapter found

1decdb5

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Merge remote-tracking branch 'upstream/main' into qlora_dev_deps

f6b81f1

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

anhuong reviewed Sep 13, 2024

View reviewed changes

Use torch.float16 for quantized

63170f9

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Merge branch 'main' into qlora_dev_deps

f16b3ac

anhuong approved these changes Sep 16, 2024

View reviewed changes

anhuong merged commit 5dd5494 into foundation-model-stack:main Sep 16, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add deps to evaluate qLora tuned model #312

feat: Add deps to evaluate qLora tuned model #312

aluu317 commented Aug 23, 2024 •

edited

Loading

anhuong left a comment

anhuong Sep 5, 2024

anhuong Sep 5, 2024

anhuong Sep 5, 2024

fabianlim Sep 6, 2024

anhuong Sep 11, 2024

aluu317 Sep 11, 2024 •

edited

Loading

anhuong Sep 13, 2024

anhuong Sep 5, 2024

fabianlim Sep 6, 2024

aluu317 Sep 11, 2024

anhuong Sep 5, 2024

achew010 Sep 6, 2024 •

edited

Loading

achew010 Sep 6, 2024 •

edited

Loading

fabianlim Sep 6, 2024 •

edited

Loading

anhuong Sep 10, 2024

anhuong Sep 13, 2024

aluu317 Sep 13, 2024

anhuong Sep 5, 2024

achew010 Sep 6, 2024 •

edited

Loading

fabianlim Sep 6, 2024

anhuong left a comment

anhuong left a comment

feat: Add deps to evaluate qLora tuned model #312

feat: Add deps to evaluate qLora tuned model #312

Conversation

aluu317 commented Aug 23, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

anhuong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aluu317 Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achew010 Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

achew010 Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

fabianlim Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achew010 Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anhuong left a comment

Choose a reason for hiding this comment

anhuong left a comment

Choose a reason for hiding this comment

aluu317 commented Aug 23, 2024 •

edited

Loading

aluu317 Sep 11, 2024 •

edited

Loading

achew010 Sep 6, 2024 •

edited

Loading

achew010 Sep 6, 2024 •

edited

Loading

fabianlim Sep 6, 2024 •

edited

Loading

achew010 Sep 6, 2024 •

edited

Loading