Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Scale factors and benchmarks #2

Closed
jeromeku opened this issue Jun 19, 2024 · 30 comments
Closed

[QST] Scale factors and benchmarks #2

jeromeku opened this issue Jun 19, 2024 · 30 comments

Comments

@jeromeku
Copy link

Great paper and thanks for open sourcing the code.

A couple questions:

  1. Is the benchmarking code in section 4 of the paper available (GEMM, FastFP16toInt8)?
  2. In the per-group W4A8 kernel, why is there a need for an additional channel-wise scale factor in FusedDequantQuant? I.e., the Int4 weights are dequantized to FP16 using group-wise scale factors, then quantized to Int8 using an additional channel-wise scale then fed to Int8 GEMM. In contrast, in the channel-wise W4A8 kernel, the Int4 weights are directly converted to Int8 then fed to Int8 GEMM.
@HandH1998
Copy link
Owner

@jeromeku
Reply to your questions:

  1. For the w4a8 GEMM benchmark, you can try bench_w4a8.py in my repo https://github.com/HandH1998/marlin/tree/w4a8. For the FastFP16toInt8 benchmark, I provide an old version GEMM code in gist https://gist.github.com/HandH1998/b96922e0a0ab7da769fd93e34ffb068a, which is the baseline using the traditional instruction converting from fp16 to int8. You can put it in https://github.com/HandH1998/marlin/tree/w4a8, and do the benchmark.
  2. As there are multiple per-group scales in one channel of weight, which are not directly compatible with standard GEMM procedures, we have to do the conversion INT4 to FP16, then to INT8. For per-channel scale, we can split the per-channel scale with GEMM using s_a * A * W * s_w, so there is no need to do the complicate conversion like per-group.

@jeromeku
Copy link
Author

jeromeku commented Jun 20, 2024

@HandH1998

Many thanks for the response!

Do you have the script used to test against other methods? Especially interested in reproducing the results against QoQ.

Also can't seem to find the FastINT4toINT8 conversion function when converting from int4 -> int8.

@HandH1998
Copy link
Owner

@HandH1998

Many thanks for the response!

Do you have the script used to test against other methods? Especially interested in reproducing the results against QoQ.

Also can't seem to find the FastINT4toINT8 conversion function when converting from int4 -> int8.

You can reproduce the QQQ results following the Readme.md's Usage. As for FastINT4toINT8 conversion, you can refer to our paper Section 3.3.1. Actually, it just performs a left shift by 4 bits to convert int4 to int8 in this line https://github.com/HandH1998/QQQ/blob/49f06e0b47c606ca2c5558ade0805b0609d57a8f/csrc/qqq_gemm.cu#L540.

@brisker
Copy link

brisker commented Jun 25, 2024

@HandH1998
Are activations dynamic or static quantization in QQQ?( you only mentioned that it is per-token quantization)

@HandH1998
Copy link
Owner

@brisker dynamic quantization

@brisker
Copy link

brisker commented Jun 25, 2024

@HandH1998
I noticed you have compared your accuracy with QServe, but QServe is w4a8 with kv4, and your QQQ seems to have fp16 kv-cache, is this comparison fair?

@HandH1998
Copy link
Owner

@brisker As QServe doesn't offer a precision of w4a8f16, we directly compare QQQ with QServe using w4a8kv4. On the other hand, QServe employs various techniques to mitigate the impact of kv4. According to their paper, SmoothAttention reduces perplexity by 0.05 without adding system overhead. Progressive group quantization further improves perplexity by an additional 0.02, with only a negligible increase in dequantization overhead. Lastly, activation-aware channel reordering enhances perplexity by 0.03. As illustrated in the following figure, the ablation study shows kv4 only increases perplexity by 0.04 compared to kv8 with these techniques. As we know, kv8 can deliver performance almost identical to fp16 kv cache, so the impact of kv4 is negligible.
image

@brisker
Copy link

brisker commented Jun 26, 2024

@HandH1998
The speedup of QQQ w4a8g128 compared to marlin w4a16g128 seem to be very limited, I think this may be due to the fp16 kvcache of QQQ. Any plan to try QQQ w4a8g128-kv8?

@HandH1998
Copy link
Owner

HandH1998 commented Jun 26, 2024

@HandH1998 The speedup of QQQ w4a8g128 compared to marlin w4a16g128 seem to be very limited, I think this may be due to the fp16 kvcache of QQQ. Any plan to try QQQ w4a8g128-kv8?

We think the speedup of QQQ w4a8g128 is limited to the high dtype conversion overhead between FP16 and INT8 as shown in the following picture. QQQ only focuses on the weight quantization, and we don't plan to develop a w4a8g128-kv8. Actually, it can increase the computing throughput of large batch size to replace kvfp16 with kv8, but is not effective for small batch size. If you want to try QQQ with low-bit kv cache, we recommend our vllm PR which provides fp8 kv cache.
image

@AniZpZ
Copy link
Collaborator

AniZpZ commented Jun 26, 2024

@HandH1998 The speedup of QQQ w4a8g128 compared to marlin w4a16g128 seem to be very limited, I think this may be due to the fp16 kvcache of QQQ. Any plan to try QQQ w4a8g128-kv8?

Thank you for your advice!
Currently, prefill speed is more essential for most inference cases, while KV cache quantization lifts decode speed. KV8 has now been well solved, and you are welcome to combine QQQ with KV cache quantization methods!

@brisker
Copy link

brisker commented Jun 27, 2024

@AniZpZ
@HandH1998
In the figure of your paper, there is w8a8 inference speed. Is this w8a8 inference speed tested on vllm? Which version of vllm?
Besides, why is w8a8 even slower than fp16 in your figure?
image

@HandH1998
Copy link
Owner

HandH1998 commented Jun 27, 2024

@brisker We developed a new version based on this PR to support dynamic activation per-token quantization. We think the online activation quantization will introduce additional overhead, resulting in slower inference speed compared to FP16 at smaller batch sizes. However, as the batch size increases, the scenario becomes compute-bound, and w8a8 is likely to outperform other quantization methods.

@brisker
Copy link

brisker commented Jun 27, 2024

@HandH1998
and the fp16 speed in the figure is the vllm-fp16 speed(already armed with paged attention or other accelerating methods), not huggingface-pytorch inference speed, right?

@HandH1998
Copy link
Owner

@HandH1998 and the fp16 speed in the figure is the vllm-fp16 speed(already armed with paged attention or other accelerating methods), not huggingface-pytorch inference speed, right?

Yes.

@brisker
Copy link

brisker commented Jun 29, 2024

@HandH1998
@AniZpZ

  1. In the PR you mentioned, how to save the corresponding w4a8-format-model to test w4a8-gemm? Is it identical to gptq-marlin w4 storage format?
  2. I use the default codes and configs in this repo, except comment these two lines (https://github.com/HandH1998/QQQ/blob/main/examples/quant_model.py#L70 and https://github.com/HandH1998/QQQ/blob/main/examples/quant_model.py#L61, otherwise NaN loss), and quant Llama2-7B and get the quantized models. And I use something like this to evaluate w4a8 and fp16 inference speed:
    kwargs = {"torch_dtype": torch.float16, "device_map": "auto", "attn_implementation": "eager"}
    fp16_model = AutoModelForCausalLM.from_pretrained(
            args.model_path, trust_remote_code=True, **kwargs
        ) 
    time1 = time.time()
    output_ids = model.generate(**inputs, max_new_tokens=args.max_new_tokens)  # model can be fp16 or w4a8 quantized model generated by QQQ
    time2 = time.time()
    print(f"decoding time: {time2-time1}")

But the w4a8 inference time is nearly double of that of fp16. Is there any bug in this repo? (w4a8 quantize Nan loss is also weird)

fp16 decoding time: 3.2025535106658936
w4a8 decoding time: 5.649582147598267

@HandH1998
Copy link
Owner

@brisker
Response to your questions:
1.We use examples/quant_model.py to export the model in the w4a8 format. The corresponding code for this format can be found in QQQ/gptq/qlinear/qlinear_marlin.py. Please note that this format is not identical to the gptq-marlin format.
2. For the NaN issue, you can try modifying the calibrate_path in https://github.com/HandH1998/QQQ/blob/main/quant_config/llama/w4a8.yaml to your pile dataset directory.
The evaluation script you used is similar to our examples/test_model.py, which only employs w4a8 GEMM without any other optimizations like kernel fusion. Actually, you should use our vLLM PR to achieve the speedup you wanted. Our repository primarily focuses on exporting quantized models and evaluating the accuracy, rather than directly speeding up inference.

@brisker
Copy link

brisker commented Jul 1, 2024

@HandH1998

  1. already use my own pile_dataset directory.
  2. Even without kernel fusion or paged attention,etc, why is w4a8 gemm slower than fp16?

@HandH1998
Copy link
Owner

@brisker

  1. May you provide a detailed log for the issue?
  2. Actually, w4a8 GEMM is always faster than fp16 in our evaluation. We employ online activation quantization, but achieve it with simple torch operation in our repo https://github.com/HandH1998/QQQ/blob/49f06e0b47c606ca2c5558ade0805b0609d57a8f/QQQ/gptq/qlinear/qlinear_marlin.py#L245-L249. When the batch size is small, this will significantly slow down the inference.

@brisker
Copy link

brisker commented Jul 1, 2024

  1. Here is the log:

(QQQ) root@train-nndf-vllm-2-0:/data1/QQQ-main# python examples/quant_model.py --model_path /dataset/LM-public/LLM/Llama-2-7b --tokenizer_path /dataset/LM-public/LLM/Llama-2-7b --batch_size 8 --dtype float16 --quant_config quant_config/llama/w4a8.yaml --save_path ./debug
/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 2/2 [00:19<00:00, 9.54s/it]
the quantization config is {'a_qconfig': {'quantizer': 'TokenFixedFakeQuantize', 'observer': 'MinMaxObserver', 'bit': 8, 'symmetric': True, 'ch_axis': 0, 'disable_down_proj': False}, 'w_qconfig': {'quantizer': 'FixedQuantize', 'observer': 'MinMaxObserver', 'bit': 4, 'symmetric': True, 'ch_axis': 0}, 'calibrate': 128, 'calibrate_path': '/share/LLM/data/pile/val.jsonl.zst', 'is_remove_padding': True, 'gptq': {'dataset': 'wikitext2', 'sym': True, 'groupsize': -1, 'mse': False, 'act_order': True, 'percdamp': 0.01, 'nsamples': 128, 'wbits': 4, 'static_groups': True}, 'max_length': 2048, 'migrate': False}
begin building calibration data!
Saving the dataset (1/1 shards): 100%|█████████████████████████████████| 128/128 [00:00<00:00, 14583.73 examples/s]
prepare fp input and output
begin smooth!
Enable observer and Enable quantize for fake_quant
*** Calibrate ***
the original min range is -4.671875, the original max range is 4.97265625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -4.67, 4.97
the weight range is -0.82, 0.72
/data1/QQQ-main/QQQ/smooth/quantization/observer.py:147: UserWarning: _aminmax is deprecated as of PyTorch 1.11 and will be removed in a future release. Use aminmax instead. This warning will only appear once per process. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:677.)
min_val_cur, max_val_cur = torch._aminmax(y, 1)
0.04 loss at iter 10
0.04 loss at iter 20
0.04 loss at iter 30
0.04 loss at iter 40
0.04 loss at iter 50
0.04 loss at iter 60
0.03 loss at iter 70
0.03 loss at iter 80
0.03 loss at iter 90
0.05 loss at iter 100
the best scale is 6.78, best min range is -0.73, best max range is 0.73
the range of weight becomes -4.65, 4.55
the original min range is -1.2626953125, the original max range is 1.9306640625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.26, 1.93
the weight range is -0.45, 0.42
0.01 loss at iter 10
0.01 loss at iter 20
0.01 loss at iter 30
0.01 loss at iter 40
0.01 loss at iter 50
0.01 loss at iter 60
0.01 loss at iter 70
0.01 loss at iter 80
0.02 loss at iter 90
0.03 loss at iter 100
the best scale is 1.26, best min range is -1.26, best max range is 1.53
the range of weight becomes -0.57, 0.42
the original min range is -2.8515625, the original max range is 1.7607421875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -2.85, 1.76
the weight range is -0.89, 0.73
0.00 loss at iter 10
0.01 loss at iter 20
0.01 loss at iter 30
0.01 loss at iter 40
0.01 loss at iter 50
0.01 loss at iter 60
0.01 loss at iter 70
0.01 loss at iter 80
0.02 loss at iter 90
0.07 loss at iter 100
the best scale is 1.05, best min range is -2.71, best max range is 1.76
the range of weight becomes -0.89, 0.73
the original min range is -1.818359375, the original max range is 1.552734375
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.82, 1.55
the weight range is -0.51, 0.53
0.02 loss at iter 10
0.02 loss at iter 20
0.02 loss at iter 30
0.02 loss at iter 40
0.02 loss at iter 50
0.02 loss at iter 60
0.02 loss at iter 70
0.03 loss at iter 80
0.05 loss at iter 90
0.16 loss at iter 100
the best scale is 1.45, best min range is -1.25, best max range is 1.25
the range of weight becomes -0.51, 0.53
the original min range is -4.109375, the original max range is 9.1171875
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -4.11, 9.12
the weight range is -0.52, 0.57
0.20 loss at iter 10
0.19 loss at iter 20
0.18 loss at iter 30
0.17 loss at iter 40
0.16 loss at iter 50
0.17 loss at iter 60
0.18 loss at iter 70
0.21 loss at iter 80
0.26 loss at iter 90
0.45 loss at iter 100
the best scale is 2.10, best min range is -4.11, best max range is 4.34
the range of weight becomes -1.09, 1.15
the original min range is -1.306640625, the original max range is 1.1962890625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.31, 1.20
the weight range is -0.49, 0.59
0.03 loss at iter 10
0.03 loss at iter 20
0.03 loss at iter 30
0.03 loss at iter 40
0.03 loss at iter 50
0.04 loss at iter 60
0.05 loss at iter 70
0.06 loss at iter 80
0.10 loss at iter 90
0.17 loss at iter 100
the best scale is 1.17, best min range is -1.11, best max range is 1.11
the range of weight becomes -0.49, 0.59
the original min range is -1.1123046875, the original max range is 2.19140625
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -1.11, 2.19
the weight range is -1.09, 0.98
0.14 loss at iter 10
0.23 loss at iter 20
0.43 loss at iter 30
0.75 loss at iter 40
1.58 loss at iter 50
3.47 loss at iter 60
13.44 loss at iter 70
13.95 loss at iter 80
47.61 loss at iter 90
371.34 loss at iter 100
the best scale is 1.11, best min range is -1.01, best max range is 1.98
the range of weight becomes -1.09, 0.98
the original min range is -109.625, the original max range is 1452.0
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -109.62, 1452.00
the weight range is -1.42, 1.56
19.31 loss at iter 10
19.29 loss at iter 20
19.21 loss at iter 30
19.20 loss at iter 40
19.15 loss at iter 50
19.11 loss at iter 60
18.87 loss at iter 70
18.88 loss at iter 80
18.85 loss at iter 90
18.83 loss at iter 100
18.83 loss at iter 110
18.81 loss at iter 120
18.79 loss at iter 130
18.66 loss at iter 140
18.59 loss at iter 150
18.57 loss at iter 160
18.56 loss at iter 170
18.60 loss at iter 180
18.17 loss at iter 190
18.11 loss at iter 200
18.09 loss at iter 210
18.08 loss at iter 220
18.09 loss at iter 230
18.02 loss at iter 240
18.02 loss at iter 250
18.00 loss at iter 260
17.80 loss at iter 270
17.82 loss at iter 280
17.73 loss at iter 290
17.70 loss at iter 300
17.69 loss at iter 310
17.65 loss at iter 320
17.65 loss at iter 330
17.71 loss at iter 340
17.60 loss at iter 350
17.68 loss at iter 360
17.45 loss at iter 370
17.39 loss at iter 380
17.43 loss at iter 390
18.08 loss at iter 400
17.91 loss at iter 410
17.90 loss at iter 420
17.54 loss at iter 430
17.51 loss at iter 440
17.55 loss at iter 450
17.53 loss at iter 460
17.46 loss at iter 470
17.35 loss at iter 480
17.40 loss at iter 490
17.48 loss at iter 500
17.32 loss at iter 510
17.12 loss at iter 520
17.15 loss at iter 530
17.22 loss at iter 540
17.11 loss at iter 550
17.10 loss at iter 560
16.96 loss at iter 570
16.92 loss at iter 580
16.93 loss at iter 590
16.92 loss at iter 600
16.86 loss at iter 610
16.87 loss at iter 620
16.85 loss at iter 630
16.86 loss at iter 640
16.96 loss at iter 650
16.77 loss at iter 660
16.82 loss at iter 670
16.84 loss at iter 680
16.85 loss at iter 690
16.87 loss at iter 700
16.90 loss at iter 710
16.71 loss at iter 720
16.72 loss at iter 730
16.74 loss at iter 740
16.56 loss at iter 750
16.77 loss at iter 760
16.81 loss at iter 770
16.81 loss at iter 780
16.92 loss at iter 790
17.09 loss at iter 800
17.43 loss at iter 810
17.39 loss at iter 820
17.56 loss at iter 830
17.30 loss at iter 840
17.64 loss at iter 850
17.97 loss at iter 860
18.04 loss at iter 870
18.03 loss at iter 880
18.00 loss at iter 890
17.96 loss at iter 900
18.04 loss at iter 910
18.06 loss at iter 920
18.13 loss at iter 930
18.10 loss at iter 940
18.15 loss at iter 950
17.72 loss at iter 960
17.87 loss at iter 970
17.96 loss at iter 980
17.86 loss at iter 990
17.88 loss at iter 1000
16.05 loss at iter 1010
15.97 loss at iter 1020
15.65 loss at iter 1030
15.52 loss at iter 1040
15.37 loss at iter 1050
15.22 loss at iter 1060
15.10 loss at iter 1070
14.97 loss at iter 1080
15.02 loss at iter 1090
14.89 loss at iter 1100
14.81 loss at iter 1110
14.71 loss at iter 1120
14.64 loss at iter 1130
14.24 loss at iter 1140
14.19 loss at iter 1150
14.12 loss at iter 1160
14.00 loss at iter 1170
14.03 loss at iter 1180
14.00 loss at iter 1190
13.89 loss at iter 1200
14.00 loss at iter 1210
14.04 loss at iter 1220
13.96 loss at iter 1230
14.04 loss at iter 1240
14.01 loss at iter 1250
14.17 loss at iter 1260
14.24 loss at iter 1270
14.43 loss at iter 1280
14.89 loss at iter 1290
14.16 loss at iter 1300
14.34 loss at iter 1310
14.12 loss at iter 1320
13.97 loss at iter 1330
13.83 loss at iter 1340
13.75 loss at iter 1350
13.66 loss at iter 1360
13.94 loss at iter 1370
13.37 loss at iter 1380
12.74 loss at iter 1390
12.77 loss at iter 1400
12.70 loss at iter 1410
12.44 loss at iter 1420
12.31 loss at iter 1430
12.08 loss at iter 1440
12.08 loss at iter 1450
11.65 loss at iter 1460
11.61 loss at iter 1470
11.26 loss at iter 1480
11.13 loss at iter 1490
10.96 loss at iter 1500
10.88 loss at iter 1510
10.40 loss at iter 1520
10.25 loss at iter 1530
10.16 loss at iter 1540
10.09 loss at iter 1550
9.97 loss at iter 1560
9.94 loss at iter 1570
9.77 loss at iter 1580
9.69 loss at iter 1590
9.75 loss at iter 1600
9.81 loss at iter 1610
9.84 loss at iter 1620
9.82 loss at iter 1630
9.89 loss at iter 1640
9.71 loss at iter 1650
9.67 loss at iter 1660
9.88 loss at iter 1670
10.07 loss at iter 1680
10.40 loss at iter 1690
10.23 loss at iter 1700
10.82 loss at iter 1710
11.30 loss at iter 1720
11.37 loss at iter 1730
11.62 loss at iter 1740
12.11 loss at iter 1750
11.31 loss at iter 1760
11.65 loss at iter 1770
11.50 loss at iter 1780
11.46 loss at iter 1790
11.12 loss at iter 1800
10.95 loss at iter 1810
10.45 loss at iter 1820
10.76 loss at iter 1830
10.38 loss at iter 1840
10.07 loss at iter 1850
9.66 loss at iter 1860
9.43 loss at iter 1870
9.29 loss at iter 1880
8.91 loss at iter 1890
8.83 loss at iter 1900
8.61 loss at iter 1910
8.41 loss at iter 1920
8.28 loss at iter 1930
8.21 loss at iter 1940
8.14 loss at iter 1950
8.02 loss at iter 1960
7.68 loss at iter 1970
7.71 loss at iter 1980
7.70 loss at iter 1990
7.31 loss at iter 2000
7.31 loss at iter 2010
7.35 loss at iter 2020
7.42 loss at iter 2030
7.56 loss at iter 2040
7.61 loss at iter 2050
7.71 loss at iter 2060
7.97 loss at iter 2070
7.99 loss at iter 2080
7.89 loss at iter 2090
7.77 loss at iter 2100
8.28 loss at iter 2110
8.20 loss at iter 2120
9.00 loss at iter 2130
9.12 loss at iter 2140
9.86 loss at iter 2150
9.86 loss at iter 2160
11.05 loss at iter 2170
11.17 loss at iter 2180
11.67 loss at iter 2190
13.12 loss at iter 2200
13.49 loss at iter 2210
11.77 loss at iter 2220
14.86 loss at iter 2230
14.36 loss at iter 2240
14.59 loss at iter 2250
13.53 loss at iter 2260
13.64 loss at iter 2270
11.91 loss at iter 2280
11.82 loss at iter 2290
10.49 loss at iter 2300
10.62 loss at iter 2310
9.08 loss at iter 2320
9.15 loss at iter 2330
7.60 loss at iter 2340
7.64 loss at iter 2350
6.77 loss at iter 2360
6.67 loss at iter 2370
5.96 loss at iter 2380
5.55 loss at iter 2390
5.62 loss at iter 2400
5.53 loss at iter 2410
4.78 loss at iter 2420
4.47 loss at iter 2430
3.99 loss at iter 2440
3.95 loss at iter 2450
3.86 loss at iter 2460
3.57 loss at iter 2470
3.59 loss at iter 2480
3.86 loss at iter 2490
3.87 loss at iter 2500
4.49 loss at iter 2510
4.50 loss at iter 2520
5.85 loss at iter 2530
6.07 loss at iter 2540
8.18 loss at iter 2550
8.93 loss at iter 2560
12.43 loss at iter 2570
13.23 loss at iter 2580
17.98 loss at iter 2590
19.02 loss at iter 2600
26.04 loss at iter 2610
27.92 loss at iter 2620
37.66 loss at iter 2630
41.27 loss at iter 2640
53.93 loss at iter 2650
59.08 loss at iter 2660
77.32 loss at iter 2670
94.08 loss at iter 2680
85.81 loss at iter 2690
85.34 loss at iter 2700
85.82 loss at iter 2710
85.09 loss at iter 2720
85.00 loss at iter 2730
85.21 loss at iter 2740
85.14 loss at iter 2750
84.86 loss at iter 2760
85.27 loss at iter 2770
85.26 loss at iter 2780
84.75 loss at iter 2790
85.16 loss at iter 2800
84.67 loss at iter 2810
84.61 loss at iter 2820
85.41 loss at iter 2830
84.99 loss at iter 2840
84.44 loss at iter 2850
84.72 loss at iter 2860
85.42 loss at iter 2870
85.07 loss at iter 2880
84.55 loss at iter 2890
85.82 loss at iter 2900
the best scale is 6.94, best min range is -109.62, best max range is 209.12
the range of weight becomes -6.40, 10.84
the original min range is -6.1796875, the original max range is 8.6953125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -6.18, 8.70
the weight range is -0.75, 1.10
0.22 loss at iter 10
0.23 loss at iter 20
0.20 loss at iter 30
0.18 loss at iter 40
0.19 loss at iter 50
0.16 loss at iter 60
0.16 loss at iter 70
0.32 loss at iter 80
1.15 loss at iter 90
1.68 loss at iter 100
the best scale is 2.46, best min range is -3.54, best max range is 3.54
the range of weight becomes -0.75, 1.10
the original min range is -0.80224609375, the original max range is 1.009765625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -0.80, 1.01
the weight range is -0.61, 0.52
0.04 loss at iter 10
0.04 loss at iter 20
0.04 loss at iter 30
0.04 loss at iter 40
0.04 loss at iter 50
0.04 loss at iter 60
0.05 loss at iter 70
0.08 loss at iter 80
0.12 loss at iter 90
0.19 loss at iter 100
the best scale is 1.00, best min range is -0.80, best max range is 1.01
the range of weight becomes -0.61, 0.52
the original min range is -4.04296875, the original max range is 5.6875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -4.04, 5.69
the weight range is -0.53, 0.46
0.05 loss at iter 10
0.05 loss at iter 20
0.05 loss at iter 30
0.05 loss at iter 40
0.05 loss at iter 50
0.05 loss at iter 60
0.05 loss at iter 70
0.06 loss at iter 80
0.10 loss at iter 90
0.27 loss at iter 100
the best scale is 1.21, best min range is -4.04, best max range is 4.68
the range of weight becomes -0.53, 0.46
the original min range is -2.1640625, the original max range is 4.15625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.16, 4.16
the weight range is -0.67, 0.90
0.09 loss at iter 10
0.09 loss at iter 20
0.09 loss at iter 30
0.09 loss at iter 40
0.09 loss at iter 50
0.09 loss at iter 60
0.09 loss at iter 70
0.10 loss at iter 80
0.20 loss at iter 90
1.11 loss at iter 100
the best scale is 1.16, best min range is -2.16, best max range is 3.59
the range of weight becomes -0.67, 1.04
the original min range is -9.0234375, the original max range is 9.4296875
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -9.02, 9.43
the weight range is -0.69, 0.78
1.29 loss at iter 10
1.18 loss at iter 20
0.97 loss at iter 30
0.84 loss at iter 40
0.83 loss at iter 50
0.85 loss at iter 60
1.02 loss at iter 70
1.87 loss at iter 80
3.75 loss at iter 90
3.77 loss at iter 100
the best scale is 1.55, best min range is -6.07, best max range is 6.07
the range of weight becomes -0.69, 0.78
the original min range is -1.43359375, the original max range is 1.94140625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.43, 1.94
the weight range is -0.54, 0.47
0.07 loss at iter 10
0.07 loss at iter 20
0.07 loss at iter 30
0.07 loss at iter 40
0.07 loss at iter 50
0.07 loss at iter 60
0.08 loss at iter 70
0.13 loss at iter 80
0.25 loss at iter 90
0.37 loss at iter 100
the best scale is 1.59, best min range is -1.22, best max range is 1.22
the range of weight becomes -0.54, 0.47
the original min range is -3.515625, the original max range is 2.857421875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -3.52, 2.86
the weight range is -0.45, 0.58
0.12 loss at iter 10
0.12 loss at iter 20
0.12 loss at iter 30
0.12 loss at iter 40
0.12 loss at iter 50
0.12 loss at iter 60
0.14 loss at iter 70
0.20 loss at iter 80
0.45 loss at iter 90
0.46 loss at iter 100
the best scale is 1.12, best min range is -3.14, best max range is 2.86
the range of weight becomes -0.45, 0.58
the original min range is -3.2734375, the original max range is 4.48828125
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -3.27, 4.49
the weight range is -0.67, 0.68
0.17 loss at iter 10
0.17 loss at iter 20
0.17 loss at iter 30
0.17 loss at iter 40
0.17 loss at iter 50
0.17 loss at iter 60
0.18 loss at iter 70
0.21 loss at iter 80
0.49 loss at iter 90
2.17 loss at iter 100
the best scale is 1.23, best min range is -3.27, best max range is 3.65
the range of weight becomes -0.67, 0.68
the original min range is -9.84375, the original max range is 10.0078125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -9.84, 10.01
the weight range is -0.68, 0.64
0.66 loss at iter 10
0.71 loss at iter 20
0.73 loss at iter 30
0.78 loss at iter 40
0.75 loss at iter 50
0.78 loss at iter 60
1.20 loss at iter 70
2.03 loss at iter 80
2.48 loss at iter 90
2.52 loss at iter 100
the best scale is 1.13, best min range is -8.82, best max range is 8.82
the range of weight becomes -0.68, 0.64
the original min range is -1.4287109375, the original max range is 1.7275390625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.43, 1.73
the weight range is -0.62, 0.53
0.09 loss at iter 10
0.09 loss at iter 20
0.09 loss at iter 30
0.09 loss at iter 40
0.09 loss at iter 50
0.10 loss at iter 60
0.13 loss at iter 70
0.21 loss at iter 80
0.35 loss at iter 90
0.50 loss at iter 100
the best scale is 1.45, best min range is -1.19, best max range is 1.19
the range of weight becomes -0.62, 0.53
the original min range is -5.93359375, the original max range is 7.40234375
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -5.93, 7.40
the weight range is -0.52, 0.45
0.24 loss at iter 10
0.24 loss at iter 20
0.24 loss at iter 30
0.24 loss at iter 40
0.24 loss at iter 50
0.24 loss at iter 60
0.25 loss at iter 70
0.30 loss at iter 80
0.68 loss at iter 90
1.43 loss at iter 100
the best scale is 1.44, best min range is -5.14, best max range is 5.14
the range of weight becomes -0.52, 0.45
the original min range is -6.1328125, the original max range is 9.65625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -6.13, 9.66
the weight range is -0.98, 0.81
0.38 loss at iter 10
0.38 loss at iter 20
0.38 loss at iter 30
0.38 loss at iter 40
0.39 loss at iter 50
0.39 loss at iter 60
0.42 loss at iter 70
0.52 loss at iter 80
1.07 loss at iter 90
5.76 loss at iter 100
the best scale is 1.33, best min range is -6.13, best max range is 7.27
the range of weight becomes -0.98, 0.81
the original min range is -10.0703125, the original max range is 9.28125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -10.07, 9.28
the weight range is -0.59, 0.64
0.95 loss at iter 10
0.94 loss at iter 20
0.82 loss at iter 30
0.88 loss at iter 40
0.84 loss at iter 50
1.03 loss at iter 60
1.75 loss at iter 70
5.10 loss at iter 80
9.43 loss at iter 90
9.15 loss at iter 100
the best scale is 1.36, best min range is -7.38, best max range is 7.38
the range of weight becomes -0.59, 0.64
the original min range is -1.796875, the original max range is 1.6865234375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.80, 1.69
the weight range is -0.53, 0.45
0.21 loss at iter 10
0.21 loss at iter 20
0.20 loss at iter 30
0.21 loss at iter 40
0.21 loss at iter 50
0.23 loss at iter 60
0.28 loss at iter 70
0.44 loss at iter 80
0.67 loss at iter 90
0.83 loss at iter 100
the best scale is 1.41, best min range is -1.27, best max range is 1.27
the range of weight becomes -0.53, 0.45
the original min range is -5.375, the original max range is 3.220703125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -5.38, 3.22
the weight range is -0.49, 0.42
0.34 loss at iter 10
0.34 loss at iter 20
0.34 loss at iter 30
0.34 loss at iter 40
0.35 loss at iter 50
0.36 loss at iter 60
0.41 loss at iter 70
0.60 loss at iter 80
1.50 loss at iter 90
1.73 loss at iter 100
the best scale is 1.00, best min range is -5.38, best max range is 3.22
the range of weight becomes -0.49, 0.42
the original min range is -5.71484375, the original max range is 3.98046875
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.71, 3.98
the weight range is -0.73, 0.56
0.45 loss at iter 10
0.46 loss at iter 20
0.46 loss at iter 30
0.45 loss at iter 40
0.46 loss at iter 50
0.46 loss at iter 60
0.47 loss at iter 70
0.66 loss at iter 80
1.74 loss at iter 90
6.30 loss at iter 100
the best scale is 1.04, best min range is -5.49, best max range is 3.98
the range of weight becomes -0.73, 0.56
the original min range is -10.90625, the original max range is 11.6640625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -10.91, 11.66
the weight range is -0.58, 0.61
2.16 loss at iter 10
2.02 loss at iter 20
1.97 loss at iter 30
2.06 loss at iter 40
2.07 loss at iter 50
2.27 loss at iter 60
4.16 loss at iter 70
11.41 loss at iter 80
20.70 loss at iter 90
19.58 loss at iter 100
the best scale is 1.42, best min range is -8.20, best max range is 8.20
the range of weight becomes -0.58, 0.61
the original min range is -1.9697265625, the original max range is 2.201171875
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.97, 2.20
the weight range is -0.61, 0.54
0.29 loss at iter 10
0.29 loss at iter 20
0.29 loss at iter 30
0.29 loss at iter 40
0.30 loss at iter 50
0.35 loss at iter 60
0.49 loss at iter 70
0.83 loss at iter 80
1.27 loss at iter 90
1.53 loss at iter 100
the best scale is 1.36, best min range is -1.61, best max range is 1.61
the range of weight becomes -0.61, 0.54
the original min range is -5.7421875, the original max range is 3.126953125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -5.74, 3.13
the weight range is -0.37, 0.57
0.54 loss at iter 10
0.55 loss at iter 20
0.56 loss at iter 30
0.57 loss at iter 40
0.60 loss at iter 50
0.65 loss at iter 60
0.76 loss at iter 70
1.10 loss at iter 80
2.54 loss at iter 90
2.78 loss at iter 100
the best scale is 1.00, best min range is -5.74, best max range is 3.13
the range of weight becomes -0.37, 0.57
the original min range is -4.27734375, the original max range is 9.6640625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -4.28, 9.66
the weight range is -0.92, 0.73
0.77 loss at iter 10
0.77 loss at iter 20
0.77 loss at iter 30
0.77 loss at iter 40
0.77 loss at iter 50
0.77 loss at iter 60
0.76 loss at iter 70
0.81 loss at iter 80
1.84 loss at iter 90
11.07 loss at iter 100
the best scale is 2.59, best min range is -3.73, best max range is 3.73
the range of weight becomes -1.62, 0.73
the original min range is -12.5703125, the original max range is 12.5390625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -12.57, 12.54
the weight range is -0.56, 0.76
2.81 loss at iter 10
2.81 loss at iter 20
2.81 loss at iter 30
2.93 loss at iter 40
3.35 loss at iter 50
3.73 loss at iter 60
6.53 loss at iter 70
17.48 loss at iter 80
26.53 loss at iter 90
24.18 loss at iter 100
the best scale is 1.28, best min range is -9.84, best max range is 9.83
the range of weight becomes -0.56, 0.76
the original min range is -2.30859375, the original max range is 2.181640625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.31, 2.18
the weight range is -0.56, 0.41
0.35 loss at iter 10
0.35 loss at iter 20
0.35 loss at iter 30
0.35 loss at iter 40
0.36 loss at iter 50
0.39 loss at iter 60
0.53 loss at iter 70
0.89 loss at iter 80
1.31 loss at iter 90
1.39 loss at iter 100
the best scale is 1.27, best min range is -1.82, best max range is 1.82
the range of weight becomes -0.56, 0.41
the original min range is -6.68359375, the original max range is 4.1953125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -6.68, 4.20
the weight range is -0.46, 0.43
0.75 loss at iter 10
0.75 loss at iter 20
0.76 loss at iter 30
0.78 loss at iter 40
0.80 loss at iter 50
0.86 loss at iter 60
1.00 loss at iter 70
1.47 loss at iter 80
3.50 loss at iter 90
4.15 loss at iter 100
the best scale is 1.00, best min range is -6.68, best max range is 4.20
the range of weight becomes -0.46, 0.43
the original min range is -9.2578125, the original max range is 3.947265625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -9.26, 3.95
the weight range is -0.95, 0.71
0.92 loss at iter 10
0.92 loss at iter 20
0.92 loss at iter 30
0.92 loss at iter 40
0.92 loss at iter 50
0.92 loss at iter 60
0.92 loss at iter 70
1.01 loss at iter 80
2.38 loss at iter 90
11.73 loss at iter 100
the best scale is 3.36, best min range is -2.76, best max range is 2.76
the range of weight becomes -0.95, 2.38
the original min range is -19.21875, the original max range is 14.71875
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -19.22, 14.72
the weight range is -0.56, 0.47
2.47 loss at iter 10
2.53 loss at iter 20
2.58 loss at iter 30
2.20 loss at iter 40
2.32 loss at iter 50
2.65 loss at iter 60
3.17 loss at iter 70
6.92 loss at iter 80
21.16 loss at iter 90
20.45 loss at iter 100
the best scale is 1.78, best min range is -10.81, best max range is 10.80
the range of weight becomes -0.56, 0.47
the original min range is -1.6796875, the original max range is 2.33984375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.68, 2.34
the weight range is -0.61, 0.75
0.46 loss at iter 10
0.45 loss at iter 20
0.46 loss at iter 30
0.46 loss at iter 40
0.47 loss at iter 50
0.53 loss at iter 60
0.71 loss at iter 70
1.15 loss at iter 80
1.63 loss at iter 90
1.75 loss at iter 100
the best scale is 1.19, best min range is -1.68, best max range is 1.96
the range of weight becomes -0.61, 0.75
the original min range is -8.796875, the original max range is 4.89453125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.80, 4.89
the weight range is -0.44, 0.65
0.87 loss at iter 10
0.87 loss at iter 20
0.87 loss at iter 30
0.88 loss at iter 40
0.90 loss at iter 50
0.92 loss at iter 60
1.01 loss at iter 70
1.36 loss at iter 80
3.20 loss at iter 90
5.43 loss at iter 100
the best scale is 1.00, best min range is -8.80, best max range is 4.89
the range of weight becomes -0.44, 0.65
the original min range is -5.6171875, the original max range is 16.0625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.62, 16.06
the weight range is -0.80, 0.76
1.05 loss at iter 10
1.04 loss at iter 20
1.04 loss at iter 30
1.04 loss at iter 40
1.03 loss at iter 50
1.03 loss at iter 60
1.04 loss at iter 70
1.04 loss at iter 80
1.38 loss at iter 90
13.04 loss at iter 100
the best scale is 1.99, best min range is -5.62, best max range is 8.08
the range of weight becomes -1.36, 0.76
the original min range is -24.40625, the original max range is 13.28125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -24.41, 13.28
the weight range is -0.78, 0.44
3.13 loss at iter 10
3.19 loss at iter 20
3.28 loss at iter 30
3.43 loss at iter 40
3.49 loss at iter 50
3.68 loss at iter 60
5.07 loss at iter 70
7.32 loss at iter 80
20.72 loss at iter 90
24.92 loss at iter 100
the best scale is 1.00, best min range is -24.41, best max range is 13.28
the range of weight becomes -0.78, 0.44
the original min range is -1.9365234375, the original max range is 2.53125
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -1.94, 2.53
the weight range is -0.55, 0.51
0.72 loss at iter 10
0.72 loss at iter 20
0.72 loss at iter 30
0.72 loss at iter 40
0.72 loss at iter 50
0.78 loss at iter 60
1.02 loss at iter 70
1.56 loss at iter 80
2.06 loss at iter 90
2.16 loss at iter 100
the best scale is 1.60, best min range is -1.58, best max range is 1.58
the range of weight becomes -0.55, 0.51
the original min range is -11.296875, the original max range is 4.82421875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -11.30, 4.82
the weight range is -0.38, 0.61
1.01 loss at iter 10
1.02 loss at iter 20
1.03 loss at iter 30
1.04 loss at iter 40
1.04 loss at iter 50
1.06 loss at iter 60
1.13 loss at iter 70
1.39 loss at iter 80
2.78 loss at iter 90
6.27 loss at iter 100
the best scale is 1.01, best min range is -11.19, best max range is 4.82
the range of weight becomes -0.38, 0.62
the original min range is -9.5, the original max range is 6.79296875
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -9.50, 6.79
the weight range is -0.66, 0.64
1.22 loss at iter 10
1.22 loss at iter 20
1.22 loss at iter 30
1.22 loss at iter 40
1.22 loss at iter 50
1.23 loss at iter 60
1.25 loss at iter 70
1.40 loss at iter 80
3.46 loss at iter 90
14.49 loss at iter 100
the best scale is 1.15, best min range is -8.28, best max range is 6.79
the range of weight becomes -0.66, 0.73
the original min range is -32.71875, the original max range is 14.6328125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -32.72, 14.63
the weight range is -0.46, 0.46
3.82 loss at iter 10
3.88 loss at iter 20
4.10 loss at iter 30
4.27 loss at iter 40
4.81 loss at iter 50
5.66 loss at iter 60
6.44 loss at iter 70
9.22 loss at iter 80
19.09 loss at iter 90
32.53 loss at iter 100
the best scale is 1.00, best min range is -32.72, best max range is 14.63
the range of weight becomes -0.46, 0.46
the original min range is -2.173828125, the original max range is 2.109375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.17, 2.11
the weight range is -0.41, 0.40
0.95 loss at iter 10
0.95 loss at iter 20
0.95 loss at iter 30
0.96 loss at iter 40
1.00 loss at iter 50
1.14 loss at iter 60
1.50 loss at iter 70
2.06 loss at iter 80
2.38 loss at iter 90
2.41 loss at iter 100
the best scale is 1.21, best min range is -1.80, best max range is 1.80
the range of weight becomes -0.41, 0.40
the original min range is -6.20703125, the original max range is 5.11328125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -6.21, 5.11
the weight range is -0.53, 0.37
1.10 loss at iter 10
1.10 loss at iter 20
1.11 loss at iter 30
1.13 loss at iter 40
1.16 loss at iter 50
1.29 loss at iter 60
1.63 loss at iter 70
2.67 loss at iter 80
6.38 loss at iter 90
6.59 loss at iter 100
the best scale is 1.00, best min range is -6.21, best max range is 5.11
the range of weight becomes -0.53, 0.37
the original min range is -6.12109375, the original max range is 3.884765625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -6.12, 3.88
the weight range is -0.75, 1.02
1.53 loss at iter 10
1.53 loss at iter 20
1.53 loss at iter 30
1.53 loss at iter 40
1.54 loss at iter 50
1.63 loss at iter 60
2.03 loss at iter 70
3.97 loss at iter 80
10.29 loss at iter 90
19.83 loss at iter 100
the best scale is 1.38, best min range is -4.43, best max range is 3.88
the range of weight becomes -0.75, 1.02
the original min range is -26.453125, the original max range is 16.953125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -26.45, 16.95
the weight range is -0.42, 0.61
4.37 loss at iter 10
4.50 loss at iter 20
4.77 loss at iter 30
5.13 loss at iter 40
5.60 loss at iter 50
6.11 loss at iter 60
7.31 loss at iter 70
11.28 loss at iter 80
28.66 loss at iter 90
30.25 loss at iter 100
the best scale is 1.00, best min range is -26.45, best max range is 16.95
the range of weight becomes -0.42, 0.61
the original min range is -2.017578125, the original max range is 2.94921875
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.02, 2.95
the weight range is -0.54, 0.58
1.16 loss at iter 10
1.16 loss at iter 20
1.16 loss at iter 30
1.16 loss at iter 40
1.16 loss at iter 50
1.19 loss at iter 60
1.38 loss at iter 70
2.10 loss at iter 80
3.01 loss at iter 90
3.23 loss at iter 100
the best scale is 1.25, best min range is -2.02, best max range is 2.35
the range of weight becomes -0.54, 0.58
the original min range is -6.36328125, the original max range is 5.1484375
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -6.36, 5.15
the weight range is -0.42, 0.42
1.25 loss at iter 10
1.25 loss at iter 20
1.25 loss at iter 30
1.27 loss at iter 40
1.31 loss at iter 50
1.45 loss at iter 60
1.82 loss at iter 70
3.01 loss at iter 80
7.02 loss at iter 90
7.19 loss at iter 100
the best scale is 1.00, best min range is -6.36, best max range is 5.15
the range of weight becomes -0.42, 0.42
the original min range is -5.87109375, the original max range is 4.9765625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.87, 4.98
the weight range is -0.77, 0.64
1.66 loss at iter 10
1.66 loss at iter 20
1.66 loss at iter 30
1.67 loss at iter 40
1.71 loss at iter 50
1.87 loss at iter 60
2.58 loss at iter 70
4.79 loss at iter 80
11.95 loss at iter 90
20.90 loss at iter 100
the best scale is 1.10, best min range is -5.36, best max range is 4.98
the range of weight becomes -0.77, 0.64
the original min range is -27.03125, the original max range is 16.484375
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -27.03, 16.48
the weight range is -0.46, 0.52
4.27 loss at iter 10
4.42 loss at iter 20
4.66 loss at iter 30
5.00 loss at iter 40
5.60 loss at iter 50
6.73 loss at iter 60
8.74 loss at iter 70
13.56 loss at iter 80
36.13 loss at iter 90
39.00 loss at iter 100
the best scale is 1.00, best min range is -27.03, best max range is 16.48
the range of weight becomes -0.46, 0.52
the original min range is -2.162109375, the original max range is 3.068359375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.16, 3.07
the weight range is -0.52, 0.50
1.26 loss at iter 10
1.26 loss at iter 20
1.26 loss at iter 30
1.27 loss at iter 40
1.28 loss at iter 50
1.34 loss at iter 60
1.70 loss at iter 70
2.62 loss at iter 80
3.47 loss at iter 90
3.63 loss at iter 100
the best scale is 1.25, best min range is -2.16, best max range is 2.45
the range of weight becomes -0.52, 0.50
the original min range is -7.73046875, the original max range is 5.015625
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -7.73, 5.02
the weight range is -0.34, 0.36
1.40 loss at iter 10
1.40 loss at iter 20
1.41 loss at iter 30
1.42 loss at iter 40
1.45 loss at iter 50
1.53 loss at iter 60
1.80 loss at iter 70
2.75 loss at iter 80
7.04 loss at iter 90
8.32 loss at iter 100
the best scale is 1.00, best min range is -7.73, best max range is 5.02
the range of weight becomes -0.34, 0.36
the original min range is -6.23828125, the original max range is 6.5625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -6.24, 6.56
the weight range is -0.59, 0.65
1.98 loss at iter 10
1.98 loss at iter 20
1.98 loss at iter 30
1.99 loss at iter 40
2.06 loss at iter 50
2.40 loss at iter 60
3.54 loss at iter 70
6.75 loss at iter 80
16.81 loss at iter 90
27.26 loss at iter 100
the best scale is 1.36, best min range is -4.82, best max range is 4.82
the range of weight becomes -0.59, 0.65
the original min range is -22.578125, the original max range is 14.5859375
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -22.58, 14.59
the weight range is -0.42, 0.68
4.39 loss at iter 10
4.48 loss at iter 20
4.58 loss at iter 30
4.67 loss at iter 40
4.62 loss at iter 50
5.31 loss at iter 60
6.50 loss at iter 70
12.50 loss at iter 80
37.71 loss at iter 90
40.32 loss at iter 100
the best scale is 1.00, best min range is -22.58, best max range is 14.59
the range of weight becomes -0.42, 0.68
the original min range is -2.564453125, the original max range is 2.240234375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.56, 2.24
the weight range is -0.46, 0.51
1.36 loss at iter 10
1.36 loss at iter 20
1.36 loss at iter 30
1.36 loss at iter 40
1.41 loss at iter 50
1.62 loss at iter 60
2.23 loss at iter 70
3.07 loss at iter 80
3.55 loss at iter 90
3.60 loss at iter 100
the best scale is 1.44, best min range is -1.78, best max range is 1.78
the range of weight becomes -0.46, 0.51
the original min range is -9.375, the original max range is 4.8046875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -9.38, 4.80
the weight range is -0.43, 0.42
1.82 loss at iter 10
1.81 loss at iter 20
1.85 loss at iter 30
1.88 loss at iter 40
1.99 loss at iter 50
2.15 loss at iter 60
2.35 loss at iter 70
3.16 loss at iter 80
7.54 loss at iter 90
11.00 loss at iter 100
the best scale is 1.00, best min range is -9.38, best max range is 4.80
the range of weight becomes -0.43, 0.42
the original min range is -8.6328125, the original max range is 8.0234375
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -8.63, 8.02
the weight range is -0.61, 1.07
2.79 loss at iter 10
2.79 loss at iter 20
2.79 loss at iter 30
2.79 loss at iter 40
2.87 loss at iter 50
3.24 loss at iter 60
4.67 loss at iter 70
8.71 loss at iter 80
22.14 loss at iter 90
37.65 loss at iter 100
the best scale is 1.48, best min range is -5.82, best max range is 5.82
the range of weight becomes -0.61, 1.07
the original min range is -23.25, the original max range is 15.125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -23.25, 15.12
the weight range is -0.57, 0.54
4.86 loss at iter 10
5.09 loss at iter 20
5.31 loss at iter 30
5.62 loss at iter 40
6.01 loss at iter 50
6.08 loss at iter 60
7.79 loss at iter 70
14.39 loss at iter 80
38.35 loss at iter 90
43.31 loss at iter 100
the best scale is 1.03, best min range is -22.56, best max range is 15.12
the range of weight becomes -0.57, 0.54
the original min range is -2.455078125, the original max range is 1.9814453125
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.46, 1.98
the weight range is -0.60, 0.54
1.55 loss at iter 10
1.55 loss at iter 20
1.55 loss at iter 30
1.54 loss at iter 40
1.57 loss at iter 50
1.76 loss at iter 60
2.38 loss at iter 70
3.01 loss at iter 80
3.15 loss at iter 90
3.15 loss at iter 100
the best scale is 1.65, best min range is -1.49, best max range is 1.49
the range of weight becomes -0.60, 0.54
the original min range is -8.171875, the original max range is 4.671875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.17, 4.67
the weight range is -0.32, 0.48
1.89 loss at iter 10
1.90 loss at iter 20
1.91 loss at iter 30
1.93 loss at iter 40
1.97 loss at iter 50
2.09 loss at iter 60
2.50 loss at iter 70
3.89 loss at iter 80
10.16 loss at iter 90
11.61 loss at iter 100
the best scale is 1.01, best min range is -8.09, best max range is 4.67
the range of weight becomes -0.32, 0.48
the original min range is -9.8203125, the original max range is 11.1796875
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -9.82, 11.18
the weight range is -0.58, 0.66
3.16 loss at iter 10
3.16 loss at iter 20
3.15 loss at iter 30
3.15 loss at iter 40
3.20 loss at iter 50
3.51 loss at iter 60
4.88 loss at iter 70
9.20 loss at iter 80
24.24 loss at iter 90
45.37 loss at iter 100
the best scale is 1.51, best min range is -7.42, best max range is 7.42
the range of weight becomes -0.58, 0.66
the original min range is -21.796875, the original max range is 16.140625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -21.80, 16.14
the weight range is -0.56, 0.58
6.98 loss at iter 10
7.10 loss at iter 20
7.32 loss at iter 30
7.08 loss at iter 40
7.34 loss at iter 50
7.52 loss at iter 60
10.07 loss at iter 70
18.99 loss at iter 80
48.24 loss at iter 90
49.61 loss at iter 100
the best scale is 1.47, best min range is -14.85, best max range is 14.86
the range of weight becomes -0.56, 0.58
the original min range is -3.126953125, the original max range is 3.376953125
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -3.13, 3.38
the weight range is -0.49, 0.38
1.97 loss at iter 10
1.96 loss at iter 20
1.95 loss at iter 30
1.96 loss at iter 40
2.02 loss at iter 50
2.28 loss at iter 60
3.17 loss at iter 70
4.82 loss at iter 80
5.96 loss at iter 90
6.04 loss at iter 100
the best scale is 1.45, best min range is -2.33, best max range is 2.33
the range of weight becomes -0.49, 0.38
the original min range is -9.3984375, the original max range is 5.07421875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -9.40, 5.07
the weight range is -0.31, 0.44
2.47 loss at iter 10
2.48 loss at iter 20
2.49 loss at iter 30
2.52 loss at iter 40
2.58 loss at iter 50
2.72 loss at iter 60
3.15 loss at iter 70
4.90 loss at iter 80
12.69 loss at iter 90
16.30 loss at iter 100
the best scale is 1.00, best min range is -9.40, best max range is 5.07
the range of weight becomes -0.31, 0.44
the original min range is -11.640625, the original max range is 8.7890625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -11.64, 8.79
the weight range is -1.06, 0.63
4.32 loss at iter 10
4.32 loss at iter 20
4.31 loss at iter 30
4.32 loss at iter 40
4.38 loss at iter 50
4.84 loss at iter 60
6.84 loss at iter 70
12.73 loss at iter 80
32.62 loss at iter 90
55.35 loss at iter 100
the best scale is 1.56, best min range is -7.48, best max range is 7.48
the range of weight becomes -1.06, 0.63
the original min range is -20.953125, the original max range is 16.3125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -20.95, 16.31
the weight range is -1.00, 0.76
7.49 loss at iter 10
7.55 loss at iter 20
7.39 loss at iter 30
7.73 loss at iter 40
8.25 loss at iter 50
8.70 loss at iter 60
11.27 loss at iter 70
26.01 loss at iter 80
59.67 loss at iter 90
63.64 loss at iter 100
the best scale is 1.41, best min range is -14.91, best max range is 14.91
the range of weight becomes -1.00, 0.76
the original min range is -2.671875, the original max range is 3.009765625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.67, 3.01
the weight range is -0.43, 0.39
2.38 loss at iter 10
2.37 loss at iter 20
2.36 loss at iter 30
2.37 loss at iter 40
2.47 loss at iter 50
2.89 loss at iter 60
4.00 loss at iter 70
5.09 loss at iter 80
5.52 loss at iter 90
5.57 loss at iter 100
the best scale is 1.45, best min range is -2.08, best max range is 2.08
the range of weight becomes -0.43, 0.39
the original min range is -9.234375, the original max range is 9.015625
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -9.23, 9.02
the weight range is -0.47, 0.37
3.23 loss at iter 10
3.24 loss at iter 20
3.27 loss at iter 30
3.31 loss at iter 40
3.41 loss at iter 50
3.61 loss at iter 60
4.35 loss at iter 70
6.65 loss at iter 80
17.53 loss at iter 90
20.01 loss at iter 100
the best scale is 1.03, best min range is -8.96, best max range is 8.96
the range of weight becomes -0.47, 0.37
the original min range is -15.53125, the original max range is 17.171875
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -15.53, 17.17
the weight range is -0.77, 0.47
5.87 loss at iter 10
5.88 loss at iter 20
5.92 loss at iter 30
5.98 loss at iter 40
6.13 loss at iter 50
6.59 loss at iter 60
8.55 loss at iter 70
15.44 loss at iter 80
40.77 loss at iter 90
80.62 loss at iter 100
the best scale is 1.03, best min range is -15.53, best max range is 16.66
the range of weight becomes -0.77, 0.47
the original min range is -15.9375, the original max range is 15.6015625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -15.94, 15.60
the weight range is -0.80, 0.95
7.22 loss at iter 10
7.61 loss at iter 20
7.65 loss at iter 30
7.27 loss at iter 40
8.81 loss at iter 50
11.42 loss at iter 60
27.57 loss at iter 70
73.74 loss at iter 80
98.04 loss at iter 90
93.47 loss at iter 100
the best scale is 1.17, best min range is -13.56, best max range is 13.56
the range of weight becomes -0.80, 0.95
the original min range is -2.77734375, the original max range is 2.994140625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -2.78, 2.99
the weight range is -0.49, 0.50
1.88 loss at iter 10
1.87 loss at iter 20
1.87 loss at iter 30
1.86 loss at iter 40
1.93 loss at iter 50
2.28 loss at iter 60
3.10 loss at iter 70
3.89 loss at iter 80
4.18 loss at iter 90
4.21 loss at iter 100
the best scale is 1.58, best min range is -1.90, best max range is 1.90
the range of weight becomes -0.49, 0.50
the original min range is -8.6484375, the original max range is 4.97265625
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.65, 4.97
the weight range is -0.40, 0.34
3.34 loss at iter 10
3.35 loss at iter 20
3.37 loss at iter 30
3.40 loss at iter 40
3.48 loss at iter 50
3.71 loss at iter 60
4.37 loss at iter 70
6.65 loss at iter 80
16.03 loss at iter 90
16.36 loss at iter 100
the best scale is 1.00, best min range is -8.65, best max range is 4.97
the range of weight becomes -0.40, 0.34
the original min range is -14.765625, the original max range is 16.078125
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -14.77, 16.08
the weight range is -1.15, 0.55
6.12 loss at iter 10
6.12 loss at iter 20
6.12 loss at iter 30
6.12 loss at iter 40
6.17 loss at iter 50
6.67 loss at iter 60
9.09 loss at iter 70
16.98 loss at iter 80
44.55 loss at iter 90
81.86 loss at iter 100
the best scale is 1.42, best min range is -11.28, best max range is 11.28
the range of weight becomes -1.15, 0.55
the original min range is -18.96875, the original max range is 16.875
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -18.97, 16.88
the weight range is -0.77, 0.68
17.57 loss at iter 10
16.84 loss at iter 20
17.56 loss at iter 30
15.99 loss at iter 40
15.99 loss at iter 50
17.78 loss at iter 60
25.08 loss at iter 70
80.55 loss at iter 80
136.00 loss at iter 90
129.57 loss at iter 100
the best scale is 1.84, best min range is -10.29, best max range is 10.29
the range of weight becomes -0.77, 0.68
the original min range is -3.53515625, the original max range is 4.0078125
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -3.54, 4.01
the weight range is -0.56, 0.57
2.40 loss at iter 10
2.39 loss at iter 20
2.39 loss at iter 30
2.38 loss at iter 40
2.42 loss at iter 50
2.75 loss at iter 60
3.88 loss at iter 70
5.49 loss at iter 80
6.34 loss at iter 90
6.40 loss at iter 100
the best scale is 1.64, best min range is -2.45, best max range is 2.45
the range of weight becomes -0.56, 0.57
the original min range is -8.4765625, the original max range is 5.10546875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.48, 5.11
the weight range is -0.36, 0.37
4.52 loss at iter 10
4.53 loss at iter 20
4.55 loss at iter 30
4.60 loss at iter 40
4.74 loss at iter 50
5.17 loss at iter 60
6.25 loss at iter 70
9.95 loss at iter 80
20.72 loss at iter 90
20.77 loss at iter 100
the best scale is 1.00, best min range is -8.48, best max range is 5.11
the range of weight becomes -0.36, 0.37
the original min range is -16.421875, the original max range is 17.453125
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -16.42, 17.45
the weight range is -1.10, 0.66
8.14 loss at iter 10
8.12 loss at iter 20
8.20 loss at iter 30
8.25 loss at iter 40
8.52 loss at iter 50
9.89 loss at iter 60
14.24 loss at iter 70
26.02 loss at iter 80
64.42 loss at iter 90
104.72 loss at iter 100
the best scale is 1.30, best min range is -13.46, best max range is 13.46
the range of weight becomes -1.10, 0.66
the original min range is -15.453125, the original max range is 15.421875
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -15.45, 15.42
the weight range is -0.72, 0.67
11.40 loss at iter 10
11.39 loss at iter 20
11.09 loss at iter 30
10.95 loss at iter 40
11.83 loss at iter 50
15.87 loss at iter 60
38.89 loss at iter 70
97.36 loss at iter 80
115.34 loss at iter 90
113.28 loss at iter 100
the best scale is 1.66, best min range is -9.31, best max range is 9.31
the range of weight becomes -0.72, 0.67
the original min range is -3.615234375, the original max range is 2.96875
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -3.62, 2.97
the weight range is -0.60, 0.68
1.89 loss at iter 10
1.89 loss at iter 20
1.87 loss at iter 30
1.86 loss at iter 40
1.90 loss at iter 50
2.17 loss at iter 60
3.03 loss at iter 70
4.02 loss at iter 80
4.36 loss at iter 90
4.37 loss at iter 100
the best scale is 1.66, best min range is -2.17, best max range is 2.17
the range of weight becomes -0.60, 0.68
the original min range is -8.828125, the original max range is 8.4921875
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.83, 8.49
the weight range is -0.57, 0.49
4.85 loss at iter 10
4.86 loss at iter 20
4.90 loss at iter 30
4.99 loss at iter 40
5.26 loss at iter 50
5.78 loss at iter 60
7.04 loss at iter 70
10.88 loss at iter 80
23.37 loss at iter 90
23.44 loss at iter 100
the best scale is 1.12, best min range is -7.87, best max range is 7.87
the range of weight becomes -0.57, 0.49
the original min range is -15.515625, the original max range is 21.625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -15.52, 21.62
the weight range is -0.97, 0.68
8.36 loss at iter 10
8.36 loss at iter 20
8.38 loss at iter 30
8.45 loss at iter 40
8.59 loss at iter 50
9.28 loss at iter 60
12.02 loss at iter 70
21.68 loss at iter 80
56.77 loss at iter 90
112.42 loss at iter 100
the best scale is 1.12, best min range is -15.52, best max range is 19.25
the range of weight becomes -0.97, 0.68
the original min range is -14.8359375, the original max range is 15.703125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -14.84, 15.70
the weight range is -1.05, 0.88
11.63 loss at iter 10
11.73 loss at iter 20
10.97 loss at iter 30
11.28 loss at iter 40
12.19 loss at iter 50
14.08 loss at iter 60
39.56 loss at iter 70
101.92 loss at iter 80
129.06 loss at iter 90
127.76 loss at iter 100
the best scale is 1.81, best min range is -8.68, best max range is 8.68
the range of weight becomes -1.05, 0.88
the original min range is -5.03515625, the original max range is 2.736328125
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.04, 2.74
the weight range is -0.58, 0.57
2.18 loss at iter 10
2.18 loss at iter 20
2.18 loss at iter 30
2.18 loss at iter 40
2.19 loss at iter 50
2.24 loss at iter 60
2.63 loss at iter 70
3.88 loss at iter 80
5.16 loss at iter 90
5.41 loss at iter 100
the best scale is 1.42, best min range is -3.56, best max range is 2.74
the range of weight becomes -0.58, 0.57
the original min range is -7.9765625, the original max range is 5.1640625
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -7.98, 5.16
the weight range is -0.32, 0.31
5.93 loss at iter 10
5.94 loss at iter 20
5.95 loss at iter 30
6.02 loss at iter 40
6.23 loss at iter 50
6.71 loss at iter 60
7.99 loss at iter 70
13.86 loss at iter 80
25.12 loss at iter 90
25.12 loss at iter 100
the best scale is 1.00, best min range is -7.98, best max range is 5.16
the range of weight becomes -0.32, 0.31
the original min range is -20.828125, the original max range is 16.234375
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -20.83, 16.23
the weight range is -0.55, 1.11
10.42 loss at iter 10
10.41 loss at iter 20
10.30 loss at iter 30
10.30 loss at iter 40
10.51 loss at iter 50
11.53 loss at iter 60
15.94 loss at iter 70
29.19 loss at iter 80
75.50 loss at iter 90
131.69 loss at iter 100
the best scale is 1.47, best min range is -14.20, best max range is 14.20
the range of weight becomes -0.55, 1.11
the original min range is -14.90625, the original max range is 14.78125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -14.91, 14.78
the weight range is -0.71, 0.94
11.11 loss at iter 10
11.16 loss at iter 20
12.48 loss at iter 30
10.89 loss at iter 40
12.74 loss at iter 50
18.03 loss at iter 60
53.67 loss at iter 70
129.87 loss at iter 80
149.28 loss at iter 90
147.94 loss at iter 100
the best scale is 1.09, best min range is -13.71, best max range is 13.72
the range of weight becomes -0.71, 0.94
the original min range is -4.32421875, the original max range is 3.69921875
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -4.32, 3.70
the weight range is -0.55, 0.54
1.79 loss at iter 10
1.79 loss at iter 20
1.79 loss at iter 30
1.79 loss at iter 40
1.79 loss at iter 50
1.92 loss at iter 60
2.53 loss at iter 70
3.40 loss at iter 80
3.73 loss at iter 90
3.73 loss at iter 100
the best scale is 1.50, best min range is -2.89, best max range is 2.89
the range of weight becomes -0.55, 0.54
the original min range is -8.5078125, the original max range is 5.08203125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.51, 5.08
the weight range is -0.30, 0.29
6.11 loss at iter 10
6.12 loss at iter 20
6.13 loss at iter 30
6.19 loss at iter 40
6.31 loss at iter 50
6.67 loss at iter 60
7.96 loss at iter 70
12.75 loss at iter 80
23.24 loss at iter 90
23.26 loss at iter 100
the best scale is 1.00, best min range is -8.51, best max range is 5.08
the range of weight becomes -0.30, 0.29
the original min range is -20.75, the original max range is 22.578125
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -20.75, 22.58
the weight range is -0.71, 0.73
9.72 loss at iter 10
9.69 loss at iter 20
9.68 loss at iter 30
9.68 loss at iter 40
9.90 loss at iter 50
11.12 loss at iter 60
15.84 loss at iter 70
30.08 loss at iter 80
79.97 loss at iter 90
143.31 loss at iter 100
the best scale is 1.49, best min range is -15.17, best max range is 15.16
the range of weight becomes -0.71, 0.73
the original min range is -14.453125, the original max range is 17.5
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -14.45, 17.50
the weight range is -0.59, 0.88
22.00 loss at iter 10
21.40 loss at iter 20
22.88 loss at iter 30
20.25 loss at iter 40
21.11 loss at iter 50
22.12 loss at iter 60
47.24 loss at iter 70
159.28 loss at iter 80
213.67 loss at iter 90
205.85 loss at iter 100
the best scale is 1.66, best min range is -10.54, best max range is 10.54
the range of weight becomes -0.59, 0.88
the original min range is -5.5, the original max range is 6.7734375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.50, 6.77
the weight range is -0.74, 0.77
2.90 loss at iter 10
2.91 loss at iter 20
2.90 loss at iter 30
2.91 loss at iter 40
2.91 loss at iter 50
2.97 loss at iter 60
3.35 loss at iter 70
5.13 loss at iter 80
7.45 loss at iter 90
7.63 loss at iter 100
the best scale is 1.07, best min range is -5.50, best max range is 6.30
the range of weight becomes -0.74, 0.77
the original min range is -8.1875, the original max range is 5.609375
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.19, 5.61
the weight range is -0.34, 0.37
6.53 loss at iter 10
6.54 loss at iter 20
6.56 loss at iter 30
6.61 loss at iter 40
6.74 loss at iter 50
7.20 loss at iter 60
8.74 loss at iter 70
14.69 loss at iter 80
24.33 loss at iter 90
24.36 loss at iter 100
the best scale is 1.00, best min range is -8.19, best max range is 5.61
the range of weight becomes -0.34, 0.37
the original min range is -25.0625, the original max range is 23.484375
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -25.06, 23.48
the weight range is -0.39, 0.33
10.21 loss at iter 10
10.21 loss at iter 20
10.21 loss at iter 30
10.25 loss at iter 40
10.56 loss at iter 50
11.90 loss at iter 60
16.80 loss at iter 70
31.94 loss at iter 80
87.03 loss at iter 90
161.81 loss at iter 100
the best scale is 1.39, best min range is -18.08, best max range is 18.08
the range of weight becomes -0.39, 0.33
the original min range is -14.5, the original max range is 16.03125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -14.50, 16.03
the weight range is -0.82, 0.69
22.18 loss at iter 10
22.97 loss at iter 20
24.90 loss at iter 30
27.05 loss at iter 40
24.52 loss at iter 50
29.62 loss at iter 60
71.34 loss at iter 70
179.91 loss at iter 80
208.45 loss at iter 90
205.40 loss at iter 100
the best scale is 1.04, best min range is -14.50, best max range is 15.38
the range of weight becomes -0.82, 0.69
the original min range is -6.5, the original max range is 3.759765625
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -6.50, 3.76
the weight range is -0.48, 0.48
2.61 loss at iter 10
2.61 loss at iter 20
2.62 loss at iter 30
2.63 loss at iter 40
2.67 loss at iter 50
2.81 loss at iter 60
3.44 loss at iter 70
5.65 loss at iter 80
7.39 loss at iter 90
7.50 loss at iter 100
the best scale is 1.15, best min range is -5.67, best max range is 3.76
the range of weight becomes -0.48, 0.48
the original min range is -7.75, the original max range is 6.08203125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -7.75, 6.08
the weight range is -0.36, 0.42
8.01 loss at iter 10
8.01 loss at iter 20
8.01 loss at iter 30
8.07 loss at iter 40
8.27 loss at iter 50
9.27 loss at iter 60
11.89 loss at iter 70
20.00 loss at iter 80
28.12 loss at iter 90
28.08 loss at iter 100
the best scale is 1.33, best min range is -5.84, best max range is 5.84
the range of weight becomes -0.36, 0.42
the original min range is -31.5625, the original max range is 36.0625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -31.56, 36.06
the weight range is -0.61, 1.31
12.43 loss at iter 10
12.42 loss at iter 20
12.40 loss at iter 30
12.40 loss at iter 40
12.54 loss at iter 50
13.16 loss at iter 60
16.19 loss at iter 70
28.25 loss at iter 80
79.99 loss at iter 90
195.18 loss at iter 100
the best scale is 1.54, best min range is -23.47, best max range is 23.48
the range of weight becomes -0.61, 1.31
the original min range is -14.734375, the original max range is 15.3828125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -14.73, 15.38
the weight range is -0.60, 1.03
12.14 loss at iter 10
12.49 loss at iter 20
12.91 loss at iter 30
12.45 loss at iter 40
14.85 loss at iter 50
22.68 loss at iter 60
70.38 loss at iter 70
157.74 loss at iter 80
168.66 loss at iter 90
166.85 loss at iter 100
the best scale is 1.16, best min range is -13.25, best max range is 13.24
the range of weight becomes -0.60, 1.03
the original min range is -5.03125, the original max range is 5.70703125
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.03, 5.71
the weight range is -0.73, 0.65
2.67 loss at iter 10
2.66 loss at iter 20
2.66 loss at iter 30
2.64 loss at iter 40
2.68 loss at iter 50
2.81 loss at iter 60
3.57 loss at iter 70
5.61 loss at iter 80
7.20 loss at iter 90
7.28 loss at iter 100
the best scale is 1.73, best min range is -3.30, best max range is 3.30
the range of weight becomes -0.73, 0.65
the original min range is -8.5078125, the original max range is 6.39453125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.51, 6.39
the weight range is -0.35, 0.31
8.14 loss at iter 10
8.13 loss at iter 20
8.13 loss at iter 30
8.16 loss at iter 40
8.27 loss at iter 50
8.69 loss at iter 60
10.34 loss at iter 70
17.09 loss at iter 80
28.30 loss at iter 90
28.33 loss at iter 100
the best scale is 1.22, best min range is -7.00, best max range is 6.39
the range of weight becomes -0.35, 0.31
the original min range is -23.1875, the original max range is 26.84375
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -23.19, 26.84
the weight range is -0.66, 0.73
12.33 loss at iter 10
12.25 loss at iter 20
12.18 loss at iter 30
12.21 loss at iter 40
12.67 loss at iter 50
14.73 loss at iter 60
21.76 loss at iter 70
41.25 loss at iter 80
111.07 loss at iter 90
197.38 loss at iter 100
the best scale is 1.49, best min range is -18.02, best max range is 18.02
the range of weight becomes -0.66, 0.73
the original min range is -13.59375, the original max range is 13.953125
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -13.59, 13.95
the weight range is -0.64, 0.58
20.61 loss at iter 10
20.83 loss at iter 20
19.56 loss at iter 30
17.19 loss at iter 40
21.61 loss at iter 50
50.43 loss at iter 60
144.57 loss at iter 70
249.55 loss at iter 80
268.50 loss at iter 90
268.79 loss at iter 100
the best scale is 1.58, best min range is -8.83, best max range is 8.83
the range of weight becomes -0.64, 0.58
the original min range is -5.00390625, the original max range is 4.2109375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.00, 4.21
the weight range is -0.44, 0.65
2.32 loss at iter 10
2.32 loss at iter 20
2.32 loss at iter 30
2.33 loss at iter 40
2.39 loss at iter 50
2.71 loss at iter 60
3.77 loss at iter 70
5.33 loss at iter 80
6.07 loss at iter 90
6.09 loss at iter 100
the best scale is 1.15, best min range is -4.37, best max range is 4.21
the range of weight becomes -0.44, 0.65
the original min range is -8.6328125, the original max range is 7.19140625
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -8.63, 7.19
the weight range is -0.55, 0.46
10.11 loss at iter 10
10.11 loss at iter 20
10.14 loss at iter 30
10.24 loss at iter 40
10.60 loss at iter 50
11.41 loss at iter 60
14.12 loss at iter 70
24.32 loss at iter 80
37.30 loss at iter 90
37.30 loss at iter 100
the best scale is 1.01, best min range is -8.55, best max range is 7.19
the range of weight becomes -0.55, 0.46
the original min range is -28.03125, the original max range is 42.28125
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -28.03, 42.28
the weight range is -1.03, 0.32
15.27 loss at iter 10
15.27 loss at iter 20
15.27 loss at iter 30
15.29 loss at iter 40
15.41 loss at iter 50
15.99 loss at iter 60
19.01 loss at iter 70
32.20 loss at iter 80
89.95 loss at iter 90
239.02 loss at iter 100
the best scale is 1.03, best min range is -28.03, best max range is 41.03
the range of weight becomes -1.03, 0.32
the original min range is -16.328125, the original max range is 14.9765625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -16.33, 14.98
the weight range is -0.86, 0.69
15.88 loss at iter 10
14.18 loss at iter 20
14.75 loss at iter 30
13.85 loss at iter 40
14.46 loss at iter 50
20.92 loss at iter 60
63.70 loss at iter 70
182.89 loss at iter 80
224.62 loss at iter 90
223.63 loss at iter 100
the best scale is 1.61, best min range is -10.16, best max range is 10.16
the range of weight becomes -0.86, 0.69
the original min range is -5.93359375, the original max range is 6.46875
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -5.93, 6.47
the weight range is -0.69, 0.63
4.97 loss at iter 10
4.94 loss at iter 20
4.93 loss at iter 30
4.92 loss at iter 40
4.94 loss at iter 50
5.15 loss at iter 60
6.13 loss at iter 70
9.20 loss at iter 80
11.92 loss at iter 90
11.99 loss at iter 100
the best scale is 1.62, best min range is -3.99, best max range is 3.99
the range of weight becomes -0.69, 0.63
the original min range is -9.53125, the original max range is 7.609375
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -9.53, 7.61
the weight range is -0.41, 0.30
10.76 loss at iter 10
10.76 loss at iter 20
10.75 loss at iter 30
10.82 loss at iter 40
11.18 loss at iter 50
12.00 loss at iter 60
14.68 loss at iter 70
24.42 loss at iter 80
43.12 loss at iter 90
43.20 loss at iter 100
the best scale is 1.31, best min range is -7.27, best max range is 7.27
the range of weight becomes -0.41, 0.30
the original min range is -44.5625, the original max range is 46.65625
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -44.56, 46.66
the weight range is -0.40, 0.58
16.30 loss at iter 10
16.26 loss at iter 20
16.24 loss at iter 30
16.27 loss at iter 40
16.52 loss at iter 50
17.55 loss at iter 60
21.72 loss at iter 70
38.46 loss at iter 80
109.69 loss at iter 90
283.09 loss at iter 100
the best scale is 1.45, best min range is -32.22, best max range is 32.22
the range of weight becomes -0.40, 0.58
the original min range is -11.921875, the original max range is 13.2890625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -11.92, 13.29
the weight range is -0.79, 0.71
26.32 loss at iter 10
25.29 loss at iter 20
26.24 loss at iter 30
26.86 loss at iter 40
25.85 loss at iter 50
39.34 loss at iter 60
132.46 loss at iter 70
287.45 loss at iter 80
318.76 loss at iter 90
318.90 loss at iter 100
the best scale is 1.23, best min range is -10.78, best max range is 10.78
the range of weight becomes -0.79, 0.71
the original min range is -6.17578125, the original max range is 6.16796875
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -6.18, 6.17
the weight range is -0.56, 0.63
4.35 loss at iter 10
4.32 loss at iter 20
4.30 loss at iter 30
4.26 loss at iter 40
4.26 loss at iter 50
4.39 loss at iter 60
5.51 loss at iter 70
8.02 loss at iter 80
9.23 loss at iter 90
9.24 loss at iter 100
the best scale is 1.79, best min range is -3.44, best max range is 3.44
the range of weight becomes -0.56, 0.63
the original min range is -10.5078125, the original max range is 8.0390625
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -10.51, 8.04
the weight range is -0.63, 0.30
13.07 loss at iter 10
13.07 loss at iter 20
13.06 loss at iter 30
13.17 loss at iter 40
13.46 loss at iter 50
14.54 loss at iter 60
17.81 loss at iter 70
30.69 loss at iter 80
59.42 loss at iter 90
59.48 loss at iter 100
the best scale is 1.37, best min range is -7.70, best max range is 7.70
the range of weight becomes -0.63, 0.30
the original min range is -40.5, the original max range is 38.59375
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -40.50, 38.59
the weight range is -0.76, 0.50
20.59 loss at iter 10
20.55 loss at iter 20
20.53 loss at iter 30
20.55 loss at iter 40
20.89 loss at iter 50
23.04 loss at iter 60
31.54 loss at iter 70
59.09 loss at iter 80
161.67 loss at iter 90
327.46 loss at iter 100
the best scale is 1.49, best min range is -27.19, best max range is 27.19
the range of weight becomes -0.76, 0.50
the original min range is -12.546875, the original max range is 12.65625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -12.55, 12.66
the weight range is -0.77, 0.63
23.53 loss at iter 10
23.66 loss at iter 20
24.06 loss at iter 30
24.19 loss at iter 40
23.25 loss at iter 50
52.22 loss at iter 60
164.42 loss at iter 70
322.55 loss at iter 80
355.16 loss at iter 90
354.91 loss at iter 100
the best scale is 1.98, best min range is -6.38, best max range is 6.38
the range of weight becomes -0.77, 0.64
the original min range is -4.50390625, the original max range is 5.98828125
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -4.50, 5.99
the weight range is -0.78, 0.39
4.51 loss at iter 10
4.50 loss at iter 20
4.49 loss at iter 30
4.48 loss at iter 40
4.50 loss at iter 50
4.94 loss at iter 60
6.83 loss at iter 70
9.87 loss at iter 80
11.26 loss at iter 90
11.33 loss at iter 100
the best scale is 1.73, best min range is -3.46, best max range is 3.46
the range of weight becomes -0.78, 0.40
the original min range is -11.421875, the original max range is 8.328125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -11.42, 8.33
the weight range is -0.40, 0.29
16.65 loss at iter 10
16.65 loss at iter 20
16.68 loss at iter 30
16.91 loss at iter 40
17.63 loss at iter 50
19.43 loss at iter 60
25.54 loss at iter 70
45.86 loss at iter 80
91.45 loss at iter 90
91.57 loss at iter 100
the best scale is 1.23, best min range is -9.27, best max range is 8.33
the range of weight becomes -0.40, 0.29
the original min range is -29.5, the original max range is 36.9375
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -29.50, 36.94
the weight range is -1.00, 0.57
26.45 loss at iter 10
26.46 loss at iter 20
26.45 loss at iter 30
26.48 loss at iter 40
26.81 loss at iter 50
28.54 loss at iter 60
37.01 loss at iter 70
68.02 loss at iter 80
183.29 loss at iter 90
361.07 loss at iter 100
the best scale is 1.14, best min range is -29.50, best max range is 32.53
the range of weight becomes -1.00, 0.57
the original min range is -12.6953125, the original max range is 12.1640625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -12.70, 12.16
the weight range is -0.74, 0.82
24.41 loss at iter 10
25.13 loss at iter 20
22.57 loss at iter 30
20.66 loss at iter 40
22.86 loss at iter 50
34.05 loss at iter 60
76.66 loss at iter 70
195.94 loss at iter 80
258.97 loss at iter 90
260.01 loss at iter 100
the best scale is 1.71, best min range is -7.41, best max range is 7.41
the range of weight becomes -0.74, 0.82
the original min range is -3.896484375, the original max range is 4.59375
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -3.90, 4.59
the weight range is -0.59, 0.63
4.05 loss at iter 10
4.05 loss at iter 20
4.04 loss at iter 30
4.09 loss at iter 40
4.17 loss at iter 50
4.90 loss at iter 60
6.82 loss at iter 70
8.64 loss at iter 80
9.26 loss at iter 90
9.27 loss at iter 100
the best scale is 1.38, best min range is -3.34, best max range is 3.34
the range of weight becomes -0.59, 0.66
the original min range is -11.8515625, the original max range is 9.1953125
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -11.85, 9.20
the weight range is -0.51, 0.55
25.19 loss at iter 10
25.20 loss at iter 20
24.99 loss at iter 30
25.56 loss at iter 40
27.68 loss at iter 50
31.75 loss at iter 60
42.52 loss at iter 70
71.88 loss at iter 80
139.81 loss at iter 90
140.06 loss at iter 100
the best scale is 1.40, best min range is -8.45, best max range is 8.45
the range of weight becomes -0.51, 0.55
the original min range is -28.953125, the original max range is 200.0
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -28.95, 200.00
the weight range is -0.98, 0.50
38.25 loss at iter 10
38.38 loss at iter 20
38.50 loss at iter 30
38.60 loss at iter 40
38.87 loss at iter 50
39.03 loss at iter 60
39.21 loss at iter 70
39.34 loss at iter 80
39.58 loss at iter 90
40.03 loss at iter 100
40.36 loss at iter 110
40.86 loss at iter 120
41.52 loss at iter 130
42.07 loss at iter 140
42.76 loss at iter 150
43.75 loss at iter 160
44.79 loss at iter 170
46.23 loss at iter 180
47.71 loss at iter 190
49.49 loss at iter 200
51.72 loss at iter 210
54.38 loss at iter 220
57.63 loss at iter 230
61.69 loss at iter 240
66.53 loss at iter 250
72.56 loss at iter 260
80.37 loss at iter 270
90.41 loss at iter 280
102.01 loss at iter 290
118.01 loss at iter 300
137.61 loss at iter 310
163.77 loss at iter 320
200.48 loss at iter 330
246.62 loss at iter 340
305.16 loss at iter 350
380.65 loss at iter 360
475.29 loss at iter 370
585.69 loss at iter 380
705.55 loss at iter 390
775.11 loss at iter 400
the best scale is 1.00, best min range is -28.95, best max range is 200.00
the range of weight becomes -0.98, 0.50
the original min range is -12.2421875, the original max range is 9.4765625
the module type is qkv
the data type is torch.float16, the device is cuda:0
the activation range is -12.24, 9.48
the weight range is -0.75, 0.82
49.53 loss at iter 10
49.31 loss at iter 20
47.80 loss at iter 30
50.71 loss at iter 40
47.56 loss at iter 50
63.99 loss at iter 60
171.92 loss at iter 70
339.59 loss at iter 80
409.09 loss at iter 90
409.08 loss at iter 100
the best scale is 1.95, best min range is -6.30, best max range is 6.29
the range of weight becomes -0.84, 0.97
the original min range is -6.50390625, the original max range is 9.0546875
the module type is o_proj
the data type is torch.float16, the device is cuda:0
the activation range is -6.50, 9.05
the weight range is -0.52, 0.62
6.41 loss at iter 10
6.40 loss at iter 20
6.37 loss at iter 30
6.47 loss at iter 40
6.56 loss at iter 50
7.04 loss at iter 60
8.63 loss at iter 70
14.16 loss at iter 80
22.00 loss at iter 90
23.01 loss at iter 100
the best scale is 1.40, best min range is -6.46, best max range is 6.46
the range of weight becomes -0.67, 0.62
the original min range is -12.46875, the original max range is 16.859375
the module type is up_and_gate
the data type is torch.float16, the device is cuda:0
the activation range is -12.47, 16.86
the weight range is -0.80, 0.71
1022.35 loss at iter 10
885.84 loss at iter 20
730.96 loss at iter 30
397.03 loss at iter 40
215.40 loss at iter 50
126.06 loss at iter 60
140.95 loss at iter 70
207.01 loss at iter 80
476.61 loss at iter 90
553.08 loss at iter 100
the best scale is 2.68, best min range is -6.30, best max range is 6.30
the range of weight becomes -0.80, 1.00
the original min range is -153.125, the original max range is inf
the module type is down_proj
the data type is torch.float16, the device is cuda:0
the activation range is -153.12, inf
the weight range is -1.30, 1.05
Traceback (most recent call last):
File "/data1/QQQ-main/examples/quant_model.py", line 88, in
main()
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data1/QQQ-main/examples/quant_model.py", line 61, in main
scale_list = smooth(model, tokenizer, q_config, args)
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data1/QQQ-main/QQQ/smooth/smooth.py", line 138, in smooth
calibrate_batch(model, [fp_input[0]])
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data1/QQQ-main/QQQ/smooth/smooth.py", line 94, in calibrate_batch
model(**batch)
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/QQQ-main/QQQ/smooth/models/quant_llama.py", line 795, in forward
outputs = self.model(
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/QQQ-main/QQQ/smooth/models/quant_llama.py", line 651, in forward
layer_outputs = decoder_layer(
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/miniconda3/envs/QQQ/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/QQQ-main/QQQ/smooth/models/quant_llama.py", line 451, in forward
best_scale = migration(
File "/data1/QQQ-main/QQQ/smooth/quantization/migration_llama.py", line 28, in migration
migrator = search_class(act, weight, a_qconfig, w_qconfig, module_type, extra_dict)
File "/data1/QQQ-main/QQQ/smooth/quantization/migration_llama.py", line 246, in init
self.num = max(100, int(self.amx / 0.5))
OverflowError: cannot convert float infinity to integer

  1. So in real Vllm PR, the activation quantization is also simple pytorch operation?

image

@HandH1998
Copy link
Owner

HandH1998 commented Jul 1, 2024

@brisker

  1. What version of the Transformers library are you using? I ran the same script with Transformers=4.36.2 and everything worked as expected. You can try another Llama model like Llama2-13b or another calibration dataset like wikitext2 if you are using the right Transformers.
  2. The code in your picture is just for unit test. Actually, the activation quantization is using https://github.com/vllm-project/vllm/blob/614aa5120303ab09be78fb1db669da198cc43b02/csrc/quantization/compressed_tensors/int8_quant_kernels.cu#L43-L71 in inference.

@brisker
Copy link

brisker commented Jul 1, 2024

@HandH1998

  1. Here is my pip_list, transformers version is identical to yours, any other difference? (Llama2-13b also fails, loss rise steadily) I still believe there may be difference between the codes in this repo and your local codes.
Package                  Version     Editable project location
------------------------ ----------- ----------------------------------
absl-py                  2.1.0
accelerate               0.27.2
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
certifi                  2024.6.2
chardet                  5.2.0
charset-normalizer       3.3.2
click                    8.1.7
colorama                 0.4.6
contourpy                1.2.1
cycler                   0.12.1
DataProperty             1.0.1
datasets                 2.17.1
dill                     0.3.8
easydict                 1.13
evaluate                 0.4.2       /data1/QQQ-main/evaluate-0.4.2/src
filelock                 3.15.4
fonttools                4.53.0
frozenlist               1.4.1
fsspec                   2023.10.0
huggingface-hub          0.20.3
idna                     3.7
Jinja2                   3.1.4
joblib                   1.4.2
jsonlines                4.0.0
kiwisolver               1.4.5
lm_eval                  0.4.2       /data1/QQQ-main/lm_eval
lxml                     5.2.2
MarkupSafe               2.1.5
matplotlib               3.9.0
mbstrdecoder             1.1.3
more-itertools           10.3.0
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.3
nltk                     3.8.1
numexpr                  2.10.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.19.3
nvidia-nvjitlink-cu12    12.5.40
nvidia-nvtx-cu12         12.1.105
packaging                24.1
pandas                   2.2.2
pathvalidate             3.2.0
peft                     0.11.1
pillow                   10.3.0
pip                      24.1.1
portalocker              2.10.0
psutil                   6.0.0
pyarrow                  16.1.0
pyarrow-hotfix           0.6
pybind11                 2.13.1
pyparsing                3.1.2
pytablewriter            1.2.0
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
QQQ                      0.0.0       /data1/QQQ-main
regex                    2024.5.15
requests                 2.32.3
rouge_score              0.1.2
sacrebleu                2.4.2
safetensors              0.4.3
scikit-learn             1.5.0
scipy                    1.14.0
sentencepiece            0.2.0
setuptools               70.0.0
six                      1.16.0
sqlitedict               2.1.0
sympy                    1.12.1
tabledata                1.3.3
tabulate                 0.9.0
tcolorpy                 0.1.6
threadpoolctl            3.5.0
tokenizers               0.15.2
torch                    2.2.1
tqdm                     4.66.4
tqdm-multiprocess        0.0.11
transformers             4.36.2
triton                   2.2.0
typepy                   1.3.2
typing_extensions        4.12.2
tzdata                   2024.1
urllib3                  2.2.2
wheel                    0.43.0
word2number              1.1
xxhash                   3.4.1
yarl                     1.9.4
zstandard                0.22.0
  1. w4a8 is well supported in version: https://github.com/HandH1998/vllm/tree/w4a8 , which has not be merged, right?

@HandH1998
Copy link
Owner

@brisker

  1. The other packages in your pip_list don't matter. I ran the github code and got the same result. I think you will get the same result as we set the random seed in code. The dataset may differ?Could you give me your email? I will send you the calibration dataset we are using.
    image

  2. Yes. But we modified our code according to vllm team's advice. If you want to reproduce the speedup in our paper, you can try the original vllm w4a8 https://github.com/HandH1998/vllm/tree/w4a8-fusion.

@brisker
Copy link

brisker commented Jul 1, 2024

@HandH1998
Thanks for reply! My email is hellowd@sjtu.edu.cn
The dataset I am using comes from hansong-mit-han-lab huggingface hub homepage. I think maybe we are using the same one

@brisker
Copy link

brisker commented Jul 1, 2024

@HandH1998
using your pile-data, loss still rising and finally the same error as before. I am using a800-gpu, I think this can not cause nan loss..

@brisker
Copy link

brisker commented Jul 1, 2024

@HandH1998
I tried the w4a8-fusion branch of vllm, but when I install it, I got error:

-- CUDA target arches: 80-real
[0/8] Performing download step (git clone) for 'cutlass-populate'
正克隆到 'cutlass-src'...
fatal: 无法访问 'https://github.com/nvidia/cutlass.git/':Failed to connect to github.com port 443: 拒绝连接
正克隆到 'cutlass-src'...
fatal: 无法访问 'https://github.com/nvidia/cutlass.git/':Failed to connect to github.com port 443: 拒绝连接
正克隆到 'cutlass-src'...
fatal: 无法访问 'https://github.com/nvidia/cutlass.git/':Failed to connect to github.com port 443: 拒绝连接
-- Had to git clone more than once: 3 times.
CMake Error at cutlass-subbuild/cutlass-populate-prefix/tmp/cutlass-populate-gitclone.cmake:39 (message):
  Failed to clone repository: 'https://github.com/nvidia/cutlass.git'


FAILED: cutlass-populate-prefix/src/cutlass-populate-stamp/cutlass-populate-download /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps/cutlass-subbuild/cutlass-populate-prefix/src/cutlass-populate-stamp/cutlass-populate-download
cd /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps && /opt/python-3.10.12/lib/python3.10/site-packages/cmake/data/bin/cmake -P /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps/cutlass-subbuild/cutlass-populate-prefix/tmp/cutlass-populate-gitclone.cmake && /opt/python-3.10.12/lib/python3.10/site-packages/cmake/data/bin/cmake -E touch /data1/vllm-w4a8-fusion/build/temp.linux-x86_64-cpython-310/_deps/cutlass-subbuild/cutlass-populate-prefix/src/cutlass-populate-stamp/cutlass-populate-download
ninja: build stopped: subcommand failed.

Currently, I can not have access to github.com to download files online. So I tried to build the cutlass myself, and after I build cutlass from source successfully, and run the install of vllm, same error occurs. Any advice on this? Thanks in advance.

@HandH1998
Copy link
Owner

@brisker I have never encountered this problem...

@brisker
Copy link

brisker commented Jul 2, 2024

@HandH1998
So which version of cutlass are you using in w4a8-fusion branch here ??

@HandH1998
Copy link
Owner

@HandH1998 So which version of cutlass are you using in w4a8-fusion branch here ??

The cutlass version can be found at CMakeLists.txt.

@brisker
Copy link

brisker commented Jul 2, 2024

@HandH1998
I have sucessfully quant and infer with w4a8 (per-channel w4, with no group)in vllm, using the qqq quantized models, using the demo you provided here

(the following speed results are directly summarized by vllm on the command line)

w4a8  Processed prompts: 100%|█████████████████████████████████████| 4/4 [00:00<00:00, 28.97it/s, Generation Speed: 463.72 toks/s]
fp16   Processed prompts: 100%|█████████████████████████████████████| 4/4 [00:00<00:00, 19.37it/s, Generation Speed: 309.96 toks/s]
  1. Is that speed normal as expected?
  2. Using your pile data still nan loss, any further advice?( I am wondering, is the wrong acc model still the right w4a8 model, which generate the right inference speed?)

@HandH1998
Copy link
Owner

HandH1998 commented Jul 2, 2024

@brisker

  1. The speed looks normal.
  2. I have no idea about the nan loss. Maybe you can try to set dtype to torch.float32 when smoothing models, and set dtype to 'half' before gptq. But I am not sure if this is the right way to solve nan issue, and how it will affect acc. The model doesn't affect the inference speed, i.e., the two models will have the same speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants