Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215

lidanqing-intel · 2022-06-06T04:15:22Z

paddle-deepmd multithreading without MKLDNN is worse than other frameworks

Deepmd multithreading issue could been exported to a simple demo + Paddle without lammps, which is in deep_md_test.zip. Below text is just comparing tensorflow and Paddle, to reproduce deepmd + Paddle multithreading issue, you can skip below texts and go directly to Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215 (comment)
Paddle Deepmd official website:
https://github.com/X4Science/paddle-deepmd
To easier reproduce multhreading issue (that paddle-deepmd multithreading is worse than tf deepmd-kit)
https://github.com/lidanqing-intel/deepmd-kit/blob/paddle-test/README.md
Reproduction environments:
Paddle version: eca6638
Test machine: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz (ICX)
Performance result:
Reproduce paddle-deepmd multithreading

git clone https://github.com/lidanqing-intel/deepmd-kit.git
git checkout paddle-test
bash compile_paddle.sh
source .bashrc
bash compile_deepmd.sh
bash compile_lammps.sh
cd setting/lmp
# single thread, single mpi and multi threads, multi mpi
bash lmp_pp.sh

Reproduce tf-test multithreading

git clone https://github.com/lidanqing-intel/deepmd-kit.git
git checkout tf-test
bash compile_tf.sh
source .bashrc
bash compile_deepmd.sh
bash compile_lammps.sh
cd setting/lmp_tf
bash lmp_tf.sh

PaddlePaddle- ernie3.0 INT8 with MKLDNN, try to improve multithreading scalability

Paddle: 0d719718b308587efcb6b3547f925582a8009176
model download https://paddlenlp.bj.bcebos.com/models/transformers/ernie_3.0/ernie3.0_medium_inference_models.zip
. After model distraction, there will be 4 files， (float32.pdmodel, float32.pdiparams) and (int8.pdmodel, int8.pdiparams)
, which is float32 model and int8 quant model.

git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP
pip install -r requirements.txt
python setup.py install
cd model_zoo/ernie-3.0

Ernie-3.0 FP32 mkldnn, 1 thread on ICX is 65.45 QPS
python infer.py --task_name tnews --model_path /home/guest/PaddleNLP/model_zoo/ernie-3.0/ernie-3.0/float32--perf --device cpu --num_threads 1
Ernie-3.0 INT8 mkldnn, 1 thread on ICX is 153.77 QPS
python infer.py --task_name tnews --model_path /home/guest/PaddleNLP/model_zoo/ernie-3.0/ernie-3.0/int8 --perf --device cpu --num_threads 1 --enable_quantize

The text was updated successfully, but these errors were encountered:

paddle-bot-old · 2022-06-06T04:15:24Z

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

jczaja · 2022-06-27T18:05:12Z

@yaomichael , @lidanqing-intel

Just to let you know that we are working on this one. So to effectively utilize parallelization there need to be more data to parallelize for example batches of data (batch size > 1) . And we found that some implementations of oneDNN for bigger batches do not work as fast as they should be . So we are looking at this now. After we narrow down problems with big batches execution we will work on actual poor multi-threading scalability.

jczaja · 2022-07-05T10:11:29Z

@jiangjiajun , @lidanqing-intel , @yaomichael
So we started working on fixing poor performance of execution when batch_size is bigger than 1. We narrowed down couple of problems and managed to fix some of them. To be more precise mostly FC (inner product) oneDNN operator's kernel was performing poorly and then convoluution for int8 was also under performing. First fix to FC (more to follow) is here: #44078 . Fix to convolution was already fixed in oneDNN so we are waiting till stable version is released (oneDNN 2.7) to introduce it to paddlepaddle.

Now we start to look at actual multi-threading issues for supported models .

jczaja · 2022-07-06T16:24:20Z

@jiangjiajun , @lidanqing-intel , @yaomichael
So we merged fixes for problem of poor performance of bigger batches. So I spend some time making measurements for Resnet50 to retest problem. Here is visual representation:

This figure presents a scalability of performance as we use more threads for various batch sizes. For batch size of 100 we can see that for one thread we have a bit over 50 FPS and for eight threads we got result of a bit 400 FPS. Which is almost 8x (almost linear scalability). For batch size : 1 , scalability is poorer. So overall it is not bad e.g. I do not see any big problem for Resnet50 int8 for the processor I used (CLX). As a next step I will either test other (non-convolutional) models like ERNIE int8.

lidanqing-intel · 2022-07-07T04:58:59Z

deep_md_test_no_model.zip
You can reproduce Paddle+Deepmd (without lammps which is clearer) by the deep_md_test.zip, but here deepmd model is too big to upload, so I sent the model by email to you. If you haven't received the email, you can reach me :)

jczaja · 2022-07-08T15:40:03Z

@jiangjiajun , @lidanqing-intel , @yaomichael

We run Ernie-3.0 int8 mutli-threading scalability test and here is visual representation:

So we can see that performance scalability for Batch Size : 100 is around 5.5x. I have not used affinity and numactl yet to bind resources to one socket so perhaps we could reach 6x after adjustments, but still a bit far from 8x. NLP models like ERNIE are not entirely compute bound , so We will analyze if we are memory bound which limits performance scalability as a next step.

lidanqing-intel · 2022-07-12T07:22:46Z

@jczaja Thank you very much for you investigation. Together with the email about Resnet50 INT8 throughput and latency, Resnet50 INT8 scalability is good. Hence we can remove Resnet50 INT8 from the multithreading issue. Now we can focus on Ernie-3.0 multithreading scalability.

jczaja · 2022-07-12T15:38:52Z

@jiangjiajun , @lidanqing-intel , @yaomichael
We digged deeper into poor performance scalability of multi-threaded ernie-3.0 int8 e.g. We check how does execution time decreases with using more threads . Here is visual represenation:

If there is good perf scalability then plots of all ops should be like FC and all of them are of similar shape apart of matmul.
Here is zoomed in picture (I removed FC):

So we can clearly see that matmul execution time does not decrease with more threads like other ops.
We looked into the code and the problem is that batched matmul is executed as separate operations because when we implemented matmul oneDNN matmul did not support batched operations. Currently batched matmul of oneDNN was added.
So as next step we will reimlpement matmul to support batched processing. This should give more space for better scalability

jczaja · 2022-07-19T15:55:34Z

@jiangjiajun , @lidanqing-intel , @yaomichael
Just to let you know where we are. So We are still reimplementing matmul oneDNN to have capability of running batched matmul in parallel. Due to complexity of of current(develop) codebase we will first enable Matmul fp32 (also used in Ernie-3.0 int8) to have batched matmul operations and then as second step we will extend matmul int8 to have batched matmul. Currently we are still working on matmul fp32 to have batched execution.

jczaja · 2022-08-02T15:20:57Z

@jiangjiajun , @lidanqing-intel , @yaomichael
Some update to let you know here we are. We have implememented first set of changes to speed up matmul (#44640) and now working similar changes for matmul int8. Here is the picture comparing performance(ernie-3.0 int8) of improved matmul fp32 (WW31) vs codebase without changes (WW29) :

We can see that for more threads there is clear improvement (green line is over purple line). Overall multi-threading performance
scalability for ernie-3.0 int8 is around ~5.9 . When we have matmul int8 improved in similar way there will be additional performance gain.

lidanqing-intel · 2022-08-08T21:01:04Z

I removed the Paddle-Deepmd from this issue, because the PaddleHPC multithreading issue has been converted to:
Enabling Paddle-Deepmd using FP32 mkldnn.

jczaja · 2022-08-09T08:21:59Z

@jiangjiajun , @lidanqing-intel , @yaomichael , @wozna , @AleksanderStankiewicz

Some update where we are with improving Ernie-3.0 int8 multi-threading scalability. We finished improving parallelization of matmul int8 and overall (whole model) multi-threading performance scalability is around ~6.6 (We improved from 5.9 -> 6.6). Here is a corresponding picture:

Next Steps:
We will do some more profiling and breakdown to operator level to see if there is anything we can do to speedup even more. As NLP models tend to be memory bound we will reach some memoroy bandwidth limitation.

jczaja · 2022-08-11T08:34:16Z

@jiangjiajun , @yaomichael , @AleksanderStankiewicz

To improve beyond 6.6x for Ernie-3.0 We need to optimize (both single-threaded and multi-threaded) a CPU operator: lookup_table_v2. This operator is not optimized for single-threaded and does not take advantage of multi-threading capabilities. We will not improve further without having it optimize.

------------------------- Event Summary -------------------------

Event Calls Total Min. Max. Ave. Ratio.
thread0::fc 4636 12935.3 0.031992 26.2104 2.7902 0.711151
thread0::elementwise_add 2562 1285.18 0.172503 5.55736 0.501633 0.070656
thread0::lookup_table_v2 488 1006.65 1.34375 21.8321 2.06281 0.055343
thread0::matmul 1464 961.73 0.26922 6.0957 0.656919 0.0528734

If you have suggestions on speeding up this operator then please share.

jczaja · 2022-08-30T13:11:46Z

@jiangjiajun , @yaomichael , @AleksanderStankiewicz

Some update on this investigation. When investigation on ERNIE-3.0 was started here, only runtime quantization e.g. --enable_quantize was working so this scenario was analyzed. Meantime running ernie-3.0 int8 from saved model was fixed
and recently we switched to that use case. Here is a picture of multi-threaded performance scalability :

So, for eight threads performance scalability for batch_size 100 is ~6.5 which is a bit lower than previosuly measured, but the absolute QPS raised from ~660 QPS to ~910 QPS due to recent improvements and the fact that we dropped --enable_quantize

Talking on improvements, there was a request to we quote a list of PRs that contributed to improvements:

Set FC input data format to ANY #44023 # FC for ResNet50
Persuading more efficient memory format to be preferred #44078 # FC for ResNet50
[WIP] Matmul v1 & v2 unification -- part 1 #44640 # Matmul for Ernie-3.0
Register matmul int8 with MatMulV2MKLDNNKernel #44908 # Matmul for Ernie-3.0
Matmuls with activation and elementwise_add fuses #44655 # More Fuses for Ernie-3.0
Enable OMP multithreading in lookup_table_v2 #45249 # lookup_table_v2

Further improvements will be available with time as Intel oneDNN enginners keep improving FC & Matmul operations inside oneDNN as they are now a major reason for ERNIE-3.0 int8 not getting closer to 8x perf scalability for eight threads. On PaddlePaddle integration side this is the end of work on this issue.

yaomichael · 2023-03-15T02:56:27Z

@jiangjiajun @onecatcn @qingqing01 let's close this optimization for multi-threaded ernie3.0

lidanqing-intel added status/new-issue 新建 type/bug-report 报bug labels Jun 6, 2022

paddle-bot-old bot assigned Shixiaowei02 Jun 6, 2022

lidanqing-intel changed the title ~~CPU Ernie and deepmd multithreading issue~~ Ernie and deepmd multithreading issue on CPU with and without MKLDNN Jun 6, 2022

lidanqing-intel assigned jiangjiajun Jun 6, 2022

lidanqing-intel added the Intel label Jun 6, 2022

lidanqing-intel changed the title ~~Ernie and deepmd multithreading issue on CPU with and without MKLDNN~~ Multithreading issue on Ernie and deepmd on CPU with and without MKLDNN Jun 15, 2022

lidanqing-intel changed the title ~~Multithreading issue on Ernie and deepmd on CPU with and without MKLDNN~~ Multithreading issue on Ernie and HPC deepmd on CPU with and without MKLDNN Jun 15, 2022

lidanqing-intel changed the title ~~Multithreading issue on Ernie and HPC deepmd on CPU with and without MKLDNN~~ Multithreading issue on Resnet50, Ernie and HPC-deepmd on CPU with and without MKLDNN Jun 15, 2022

lidanqing-intel changed the title ~~Multithreading issue on Resnet50, Ernie and HPC-deepmd on CPU with and without MKLDNN~~ Multithreading issue on PaddleHPC-deepmd, Ernie and Resnet50 on CPU with and without MKLDNN Jun 29, 2022

jczaja self-assigned this Jul 5, 2022

lidanqing-intel changed the title ~~Multithreading issue on PaddleHPC-deepmd, Ernie and Resnet50 on CPU with and without MKLDNN~~ Multithreading scalability on Ernie INT8 with oneDNN and PaddleHPC-deepmd, Resnet50 without MKLDNN on CPU Jul 12, 2022

lidanqing-intel changed the title ~~Multithreading scalability on Ernie INT8 with oneDNN and PaddleHPC-deepmd, Resnet50 without MKLDNN on CPU~~ Multithreading scalability on Ernie INT8 with oneDNN Resnet50 without MKLDNN on CPU Aug 8, 2022

lidanqing-intel changed the title ~~Multithreading scalability on Ernie INT8 with oneDNN Resnet50 without MKLDNN on CPU~~ Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU Aug 8, 2022

onecatcn closed this as completed Mar 15, 2023

paddle-bot bot added status/close 已关闭 and removed status/new-issue 新建 labels Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215

Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215

lidanqing-intel commented Jun 6, 2022 •

edited

Loading

paddle-bot-old bot commented Jun 6, 2022

jczaja commented Jun 27, 2022

jczaja commented Jul 5, 2022

jczaja commented Jul 6, 2022

lidanqing-intel commented Jul 7, 2022 •

edited

Loading

jczaja commented Jul 8, 2022 •

edited

Loading

lidanqing-intel commented Jul 12, 2022

jczaja commented Jul 12, 2022

jczaja commented Jul 19, 2022

jczaja commented Aug 2, 2022

lidanqing-intel commented Aug 8, 2022

jczaja commented Aug 9, 2022 •

edited

Loading

jczaja commented Aug 11, 2022 •

edited

Loading

jczaja commented Aug 30, 2022 •

edited

Loading

yaomichael commented Mar 15, 2023

Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215

Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215

Comments

lidanqing-intel commented Jun 6, 2022 • edited Loading

paddle-deepmd multithreading without MKLDNN is worse than other frameworks

PaddlePaddle- ernie3.0 INT8 with MKLDNN, try to improve multithreading scalability

paddle-bot-old bot commented Jun 6, 2022

jczaja commented Jun 27, 2022

jczaja commented Jul 5, 2022

jczaja commented Jul 6, 2022

lidanqing-intel commented Jul 7, 2022 • edited Loading

jczaja commented Jul 8, 2022 • edited Loading

lidanqing-intel commented Jul 12, 2022

jczaja commented Jul 12, 2022

jczaja commented Jul 19, 2022

jczaja commented Aug 2, 2022

lidanqing-intel commented Aug 8, 2022

jczaja commented Aug 9, 2022 • edited Loading

jczaja commented Aug 11, 2022 • edited Loading

jczaja commented Aug 30, 2022 • edited Loading

yaomichael commented Mar 15, 2023

lidanqing-intel commented Jun 6, 2022 •

edited

Loading

lidanqing-intel commented Jul 7, 2022 •

edited

Loading

jczaja commented Jul 8, 2022 •

edited

Loading

jczaja commented Aug 9, 2022 •

edited

Loading

jczaja commented Aug 11, 2022 •

edited

Loading

jczaja commented Aug 30, 2022 •

edited

Loading