Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bayesian optimization tool for mixed precision quantization #694

Merged
merged 15 commits into from
Aug 27, 2024

Conversation

Hanxian97
Copy link
Contributor

@Hanxian97 Hanxian97 commented Aug 16, 2024

Summary:
This is a prototype for using Bayesian Optimization to automatically decide mixed-precision quantization. For Hanxian Huang's intern project milestone3.

Test Plan:
python BO_acc_modelsize.py
python BO_acc_throughput.py

Copy link

pytorch-bot bot commented Aug 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/694

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5e87c30 with merge base b523f9f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Hanxian97 Hanxian97 marked this pull request as draft August 16, 2024 21:56
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 16, 2024

quantize_(
model.to(device=device),
intN_weight_only(n=bit_width, group_size=groupsize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could be using #672 as well, that will make the model size calculation easy: https://github.com/jerryzh168/ao/blob/e06c05a432c74dea395ce40dd8454d25fed7fac2/test/dtypes/test_uintx.py#L122, although I feel we could have an option to remove zero_point as well

Comment on lines 93 to 119
def define_parameter_list():
# define the search space
parameter_choices_list = []
for i in [2, 3, 4]:
for j in [32, 64]:
parameter_choices_list.append(str(i) + "_" + str(j))

for i in [5, 6, 8]:
for j in [32, 64, 128, 256]:
parameter_choices_list.append(str(i) + "_" + str(j))

# define the search space for all layers
parameters_list = []
# skip the first3 and last2
for i in range(3, 30):
parameters_list.append(
{
"name": f".{i}.",
"type": "choice",
"value_type": "str",
"values": parameter_choices_list,
"is_ordered": False,
"sort_values": False,
}
)

return parameters_list

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is rather problematic in that all parameters here are categorical (unordered discrete) parameters.

This means that the underlying surrogate model will not be able to exploit closeness in the parameter space as everything is string encoded. That is, we won't be able to use the information that we expect the model performance and size of a (2, 32) setting to be closer to that of a (3, 32) setting than to that of a (8, 32) setting. This information is completely lost to our surrogate models due to the string encoding.

Instead, you should just define two parameters for each layer, one bit_width_{i} parameter with values [2, 3, 4, 5, 6, 7, 8] and one groupsize_{i} parameter with values [32, 64, 128, 256].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! updated the code accordingly

Copy link
Contributor

@andrewor14 andrewor14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Hanxian. It looks great overall. My comments are mostly around usability and readability. In your next PR, please highlight the results you got so far in the README.

@@ -29,7 +29,7 @@ def apply_intN_weight_only_quant_asym(weight):
quant_max = 2**n-1
eps = 1e-6
preserve_zero = True
zero_point_dtype = torch.int64
zero_point_dtype = torch.int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found int32 is sufficient for uint2/3/5/6 from the results, so as to save some model size?

if __name__ == '__main__':

import argparse
parser = argparse.ArgumentParser(description='Your CLI description.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update this (I also noticed the same in all the mixed precision scripts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated all


print("------Finish BO------")
for h in history:
print(h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to save this in a file somewhere, and maybe let the user specify the path of this file as an arg option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the format of history here? Does it make sense to structure this as a CSV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it save to a user-specified path csv file

for i in range(len(gpu_list)):
current_trial_id, config, eval_results = return_dict[i]
history.append((eval_results, config))
ax_client.complete_trial(trial_index=current_trial_id,raw_data=eval_results,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need space after each comma

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added space

print(best_parameters, values)

print("------history------")
print(ax_client.generation_strategy.trials_as_df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many trials do we usually get here? Probably makes sense to save this as a file somewhere too, and just print the summary statistics such as: best configs, num trials to get to best config, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently made the history part into a csv and print out the best config over all the history.

import argparse
parser = argparse.ArgumentParser(description='Your CLI description.')
parser.add_argument('--device', type=str, default="cuda", help='Device to use for evaluation')
parser.add_argument('--checkpoint', type=str, default="/home/hanxianhuang/ao/torchao/quantization/prototype/mixed_precision/checkpoints/meta-llama/Meta-Llama-3-8B", help='Path to load model')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably just default to /tmp/Meta-Llama-3-8B

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the default path

parser.add_argument('--limit', type=int, default=None, help='Number of eval samples to evaluate')
parser.add_argument('--num_initial', type=int, default=50, help='Number of initial points sampled by sensitivity scores')
parser.add_argument('--num_trials', type=int, default=150, help='Number of trials to run BO')
parser.add_argument('--model_size_constraint', type=float, default=6.0, help='The model size constraint for BO')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to specify the unit (GB)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the unit

parser.add_argument('--device', type=str, default="cuda", help='Device to use for evaluation')
parser.add_argument('--checkpoint', type=str, default="/home/hanxianhuang/ao/torchao/quantization/prototype/mixed_precision/checkpoints/meta-llama/Meta-Llama-3-8B", help='Path to load model')
parser.add_argument('--limit', type=int, default=None, help='Number of eval samples to evaluate')
parser.add_argument('--num_initial', type=int, default=50, help='Number of initial points sampled by sensitivity scores')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear what these two args do from their names (limit and num_initial). Can we call these something more specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed the two args


return parameters_list

# add initial search points based on the sensitivity score, need to automate this part
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree with this. Please feel free to address this in a future PR since this is an optional step, but we should add a TODO to reformat this into a config file the user can specify


return result["results"]["wikitext"]["word_perplexity,none"]

def cal_model_size(model, fqn_to_config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty specific to llama3. Is there an easy way to make this more general? Otherwise not a big deal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@Hanxian97 Hanxian97 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it currently works for transformers and ViTs when quantizing linear layers (attention and mlp layers), and other layers will be calculated as 16-bit (since the quantize_ api requires a bf16 as input)

default_device = 'cuda' if torch.cuda.is_available() else 'cpu'

# quantize a model based on a given quantization configuration
def quantize_by_fqn_to_config(model, device, fqn_to_config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor these two scripts to reuse as much code as possible! I think this (and many other functions) are duplicated right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some of them into utils function

@@ -0,0 +1,419 @@
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this BO_model_size.py to be consistent with the other script (BO_inference_speed.py)? Since both are optimizing for accuracy technically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to BO_acc_modelsize.py and BO_acc_throughput.py

name = "test_quantize_BO",
objectives = {"cal_PPL": ObjectiveProperties(minimize=True)},
choose_generation_strategy_kwargs = {
"num_BO_initial_samplesization_trials": num_BO_initial_samples, # the number of trials to build generation strategy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like a correct kwarg? Did you mean num_initialization_trials?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it!

Comment on lines 280 to 282
# Wait for all processes to finish
for p in processes:
p.join()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outside the scope of this PR, but in the future we may want to consider a pattern here that completes a trial + generates a new config + launches a new trial whenever one of the running processes is complete. This is particularly important if there is large heterogeneity in the evaluation times between trials.

Copy link
Contributor

@andrewor14 andrewor14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Hanxian97 Hanxian97 marked this pull request as ready for review August 27, 2024 17:25
@andrewor14
Copy link
Contributor

Thanks! I'm merging this. Please address the TODOs in future PRs.

@andrewor14 andrewor14 merged commit 31ff4d9 into main Aug 27, 2024
16 checks passed
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
`torch.zero`->`torch.zeros`, as former does not exist.

Also, capture only `RuntimeErrors` to avoid making such typos in the future
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants