-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bayesian optimization tool for mixed precision quantization #694
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/694
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5e87c30 with merge base b523f9f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
||
quantize_( | ||
model.to(device=device), | ||
intN_weight_only(n=bit_width, group_size=groupsize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could be using #672 as well, that will make the model size calculation easy: https://github.com/jerryzh168/ao/blob/e06c05a432c74dea395ce40dd8454d25fed7fac2/test/dtypes/test_uintx.py#L122, although I feel we could have an option to remove zero_point as well
def define_parameter_list(): | ||
# define the search space | ||
parameter_choices_list = [] | ||
for i in [2, 3, 4]: | ||
for j in [32, 64]: | ||
parameter_choices_list.append(str(i) + "_" + str(j)) | ||
|
||
for i in [5, 6, 8]: | ||
for j in [32, 64, 128, 256]: | ||
parameter_choices_list.append(str(i) + "_" + str(j)) | ||
|
||
# define the search space for all layers | ||
parameters_list = [] | ||
# skip the first3 and last2 | ||
for i in range(3, 30): | ||
parameters_list.append( | ||
{ | ||
"name": f".{i}.", | ||
"type": "choice", | ||
"value_type": "str", | ||
"values": parameter_choices_list, | ||
"is_ordered": False, | ||
"sort_values": False, | ||
} | ||
) | ||
|
||
return parameters_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is rather problematic in that all parameters here are categorical (unordered discrete) parameters.
This means that the underlying surrogate model will not be able to exploit closeness in the parameter space as everything is string encoded. That is, we won't be able to use the information that we expect the model performance and size of a (2, 32)
setting to be closer to that of a (3, 32)
setting than to that of a (8, 32)
setting. This information is completely lost to our surrogate models due to the string encoding.
Instead, you should just define two parameters for each layer, one bit_width_{i}
parameter with values [2, 3, 4, 5, 6, 7, 8]
and one groupsize_{i}
parameter with values [32, 64, 128, 256]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! updated the code accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Hanxian. It looks great overall. My comments are mostly around usability and readability. In your next PR, please highlight the results you got so far in the README.
@@ -29,7 +29,7 @@ def apply_intN_weight_only_quant_asym(weight): | |||
quant_max = 2**n-1 | |||
eps = 1e-6 | |||
preserve_zero = True | |||
zero_point_dtype = torch.int64 | |||
zero_point_dtype = torch.int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found int32 is sufficient for uint2/3/5/6 from the results, so as to save some model size?
if __name__ == '__main__': | ||
|
||
import argparse | ||
parser = argparse.ArgumentParser(description='Your CLI description.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update this (I also noticed the same in all the mixed precision scripts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated all
|
||
print("------Finish BO------") | ||
for h in history: | ||
print(h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to save this in a file somewhere, and maybe let the user specify the path of this file as an arg option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the format of history here? Does it make sense to structure this as a CSV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it save to a user-specified path csv file
for i in range(len(gpu_list)): | ||
current_trial_id, config, eval_results = return_dict[i] | ||
history.append((eval_results, config)) | ||
ax_client.complete_trial(trial_index=current_trial_id,raw_data=eval_results,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need space after each comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added space
print(best_parameters, values) | ||
|
||
print("------history------") | ||
print(ax_client.generation_strategy.trials_as_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many trials do we usually get here? Probably makes sense to save this as a file somewhere too, and just print the summary statistics such as: best configs, num trials to get to best config, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently made the history part into a csv and print out the best config over all the history.
import argparse | ||
parser = argparse.ArgumentParser(description='Your CLI description.') | ||
parser.add_argument('--device', type=str, default="cuda", help='Device to use for evaluation') | ||
parser.add_argument('--checkpoint', type=str, default="/home/hanxianhuang/ao/torchao/quantization/prototype/mixed_precision/checkpoints/meta-llama/Meta-Llama-3-8B", help='Path to load model') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably just default to /tmp/Meta-Llama-3-8B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the default path
parser.add_argument('--limit', type=int, default=None, help='Number of eval samples to evaluate') | ||
parser.add_argument('--num_initial', type=int, default=50, help='Number of initial points sampled by sensitivity scores') | ||
parser.add_argument('--num_trials', type=int, default=150, help='Number of trials to run BO') | ||
parser.add_argument('--model_size_constraint', type=float, default=6.0, help='The model size constraint for BO') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to specify the unit (GB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the unit
parser.add_argument('--device', type=str, default="cuda", help='Device to use for evaluation') | ||
parser.add_argument('--checkpoint', type=str, default="/home/hanxianhuang/ao/torchao/quantization/prototype/mixed_precision/checkpoints/meta-llama/Meta-Llama-3-8B", help='Path to load model') | ||
parser.add_argument('--limit', type=int, default=None, help='Number of eval samples to evaluate') | ||
parser.add_argument('--num_initial', type=int, default=50, help='Number of initial points sampled by sensitivity scores') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear what these two args do from their names (limit
and num_initial
). Can we call these something more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed the two args
|
||
return parameters_list | ||
|
||
# add initial search points based on the sensitivity score, need to automate this part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agree with this. Please feel free to address this in a future PR since this is an optional step, but we should add a TODO to reformat this into a config file the user can specify
|
||
return result["results"]["wikitext"]["word_perplexity,none"] | ||
|
||
def cal_model_size(model, fqn_to_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty specific to llama3. Is there an easy way to make this more general? Otherwise not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it as a TODO. Jerry points out #672 and https://github.com/jerryzh168/ao/blob/e06c05a432c74dea395ce40dd8454d25fed7fac2/test/dtypes/test_uintx.py#L122
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it currently works for transformers and ViTs when quantizing linear layers (attention and mlp layers), and other layers will be calculated as 16-bit (since the quantize_ api requires a bf16 as input)
default_device = 'cuda' if torch.cuda.is_available() else 'cpu' | ||
|
||
# quantize a model based on a given quantization configuration | ||
def quantize_by_fqn_to_config(model, device, fqn_to_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refactor these two scripts to reuse as much code as possible! I think this (and many other functions) are duplicated right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made some of them into utils function
@@ -0,0 +1,419 @@ | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call this BO_model_size.py to be consistent with the other script (BO_inference_speed.py)? Since both are optimizing for accuracy technically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to BO_acc_modelsize.py and BO_acc_throughput.py
name = "test_quantize_BO", | ||
objectives = {"cal_PPL": ObjectiveProperties(minimize=True)}, | ||
choose_generation_strategy_kwargs = { | ||
"num_BO_initial_samplesization_trials": num_BO_initial_samples, # the number of trials to build generation strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like a correct kwarg? Did you mean num_initialization_trials
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it!
# Wait for all processes to finish | ||
for p in processes: | ||
p.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outside the scope of this PR, but in the future we may want to consider a pattern here that completes a trial + generates a new config + launches a new trial whenever one of the running processes is complete. This is particularly important if there is large heterogeneity in the evaluation times between trials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks! I'm merging this. Please address the TODOs in future PRs. |
`torch.zero`->`torch.zeros`, as former does not exist. Also, capture only `RuntimeErrors` to avoid making such typos in the future
Summary:
This is a prototype for using Bayesian Optimization to automatically decide mixed-precision quantization. For Hanxian Huang's intern project milestone3.
Test Plan:
python BO_acc_modelsize.py
python BO_acc_throughput.py