Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215

Closed
lidanqing-intel opened this issue Jun 6, 2022 · 15 comments
Assignees

Comments

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Jun 6, 2022

paddle-deepmd multithreading without MKLDNN is worse than other frameworks

git clone https://github.com/lidanqing-intel/deepmd-kit.git
git checkout paddle-test
bash compile_paddle.sh
source .bashrc
bash compile_deepmd.sh
bash compile_lammps.sh
cd setting/lmp
# single thread, single mpi and multi threads, multi mpi
bash lmp_pp.sh
  • Reproduce tf-test multithreading
git clone https://github.com/lidanqing-intel/deepmd-kit.git
git checkout tf-test
bash compile_tf.sh
source .bashrc
bash compile_deepmd.sh
bash compile_lammps.sh
cd setting/lmp_tf
bash lmp_tf.sh

PaddlePaddle- ernie3.0 INT8 with MKLDNN, try to improve multithreading scalability

Paddle: 0d719718b308587efcb6b3547f925582a8009176
model download https://paddlenlp.bj.bcebos.com/models/transformers/ernie_3.0/ernie3.0_medium_inference_models.zip
. After model distraction, there will be 4 files, (float32.pdmodel, float32.pdiparams) and (int8.pdmodel, int8.pdiparams)
, which is float32 model and int8 quant model.

git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP
pip install -r requirements.txt
python setup.py install
cd model_zoo/ernie-3.0
  • Ernie-3.0 FP32 mkldnn, 1 thread on ICX is 65.45 QPS
    python infer.py --task_name tnews --model_path /home/guest/PaddleNLP/model_zoo/ernie-3.0/ernie-3.0/float32--perf --device cpu --num_threads 1

  • Ernie-3.0 INT8 mkldnn, 1 thread on ICX is 153.77 QPS
    python infer.py --task_name tnews --model_path /home/guest/PaddleNLP/model_zoo/ernie-3.0/ernie-3.0/int8 --perf --device cpu --num_threads 1 --enable_quantize

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jun 6, 2022

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@lidanqing-intel lidanqing-intel changed the title CPU Ernie and deepmd multithreading issue Ernie and deepmd multithreading issue on CPU with and without MKLDNN Jun 6, 2022
@lidanqing-intel lidanqing-intel changed the title Ernie and deepmd multithreading issue on CPU with and without MKLDNN Multithreading issue on Ernie and deepmd on CPU with and without MKLDNN Jun 15, 2022
@lidanqing-intel lidanqing-intel changed the title Multithreading issue on Ernie and deepmd on CPU with and without MKLDNN Multithreading issue on Ernie and HPC deepmd on CPU with and without MKLDNN Jun 15, 2022
@lidanqing-intel lidanqing-intel changed the title Multithreading issue on Ernie and HPC deepmd on CPU with and without MKLDNN Multithreading issue on Resnet50, Ernie and HPC-deepmd on CPU with and without MKLDNN Jun 15, 2022
@jczaja
Copy link
Contributor

jczaja commented Jun 27, 2022

@yaomichael , @lidanqing-intel

Just to let you know that we are working on this one. So to effectively utilize parallelization there need to be more data to parallelize for example batches of data (batch size > 1) . And we found that some implementations of oneDNN for bigger batches do not work as fast as they should be . So we are looking at this now. After we narrow down problems with big batches execution we will work on actual poor multi-threading scalability.

@lidanqing-intel lidanqing-intel changed the title Multithreading issue on Resnet50, Ernie and HPC-deepmd on CPU with and without MKLDNN Multithreading issue on PaddleHPC-deepmd, Ernie and Resnet50 on CPU with and without MKLDNN Jun 29, 2022
@jczaja jczaja self-assigned this Jul 5, 2022
@jczaja
Copy link
Contributor

jczaja commented Jul 5, 2022

@jiangjiajun , @lidanqing-intel , @yaomichael
So we started working on fixing poor performance of execution when batch_size is bigger than 1. We narrowed down couple of problems and managed to fix some of them. To be more precise mostly FC (inner product) oneDNN operator's kernel was performing poorly and then convoluution for int8 was also under performing. First fix to FC (more to follow) is here: #44078 . Fix to convolution was already fixed in oneDNN so we are waiting till stable version is released (oneDNN 2.7) to introduce it to paddlepaddle.

Now we start to look at actual multi-threading issues for supported models .

@jczaja
Copy link
Contributor

jczaja commented Jul 6, 2022

@jiangjiajun , @lidanqing-intel , @yaomichael
So we merged fixes for problem of poor performance of bigger batches. So I spend some time making measurements for Resnet50 to retest problem. Here is visual representation:
Resnet50 int8-scalability
This figure presents a scalability of performance as we use more threads for various batch sizes. For batch size of 100 we can see that for one thread we have a bit over 50 FPS and for eight threads we got result of a bit 400 FPS. Which is almost 8x (almost linear scalability). For batch size : 1 , scalability is poorer. So overall it is not bad e.g. I do not see any big problem for Resnet50 int8 for the processor I used (CLX). As a next step I will either test other (non-convolutional) models like ERNIE int8.

@lidanqing-intel
Copy link
Contributor Author

lidanqing-intel commented Jul 7, 2022

deep_md_test_no_model.zip
You can reproduce Paddle+Deepmd (without lammps which is clearer) by the deep_md_test.zip, but here deepmd model is too big to upload, so I sent the model by email to you. If you haven't received the email, you can reach me :)

@jczaja
Copy link
Contributor

jczaja commented Jul 8, 2022

@jiangjiajun , @lidanqing-intel , @yaomichael

We run Ernie-3.0 int8 mutli-threading scalability test and here is visual representation:
Ernie-3 0 int8-scalability
So we can see that performance scalability for Batch Size : 100 is around 5.5x. I have not used affinity and numactl yet to bind resources to one socket so perhaps we could reach 6x after adjustments, but still a bit far from 8x. NLP models like ERNIE are not entirely compute bound , so We will analyze if we are memory bound which limits performance scalability as a next step.

@lidanqing-intel
Copy link
Contributor Author

@jczaja Thank you very much for you investigation. Together with the email about Resnet50 INT8 throughput and latency, Resnet50 INT8 scalability is good. Hence we can remove Resnet50 INT8 from the multithreading issue. Now we can focus on Ernie-3.0 multithreading scalability.

@jczaja
Copy link
Contributor

jczaja commented Jul 12, 2022

@jiangjiajun , @lidanqing-intel , @yaomichael
We digged deeper into poor performance scalability of multi-threaded ernie-3.0 int8 e.g. We check how does execution time decreases with using more threads . Here is visual represenation:
ERNIE-int8-bs100-operators-scalability
If there is good perf scalability then plots of all ops should be like FC and all of them are of similar shape apart of matmul.
Here is zoomed in picture (I removed FC):
ERNIE-int8-bs100-operators(no fc)-scalability
So we can clearly see that matmul execution time does not decrease with more threads like other ops.
We looked into the code and the problem is that batched matmul is executed as separate operations because when we implemented matmul oneDNN matmul did not support batched operations. Currently batched matmul of oneDNN was added.
So as next step we will reimlpement matmul to support batched processing. This should give more space for better scalability

@lidanqing-intel lidanqing-intel changed the title Multithreading issue on PaddleHPC-deepmd, Ernie and Resnet50 on CPU with and without MKLDNN Multithreading scalability on Ernie INT8 with oneDNN and PaddleHPC-deepmd, Resnet50 without MKLDNN on CPU Jul 12, 2022
@jczaja
Copy link
Contributor

jczaja commented Jul 19, 2022

@jiangjiajun , @lidanqing-intel , @yaomichael
Just to let you know where we are. So We are still reimplementing matmul oneDNN to have capability of running batched matmul in parallel. Due to complexity of of current(develop) codebase we will first enable Matmul fp32 (also used in Ernie-3.0 int8) to have batched matmul operations and then as second step we will extend matmul int8 to have batched matmul. Currently we are still working on matmul fp32 to have batched execution.

@jczaja
Copy link
Contributor

jczaja commented Aug 2, 2022

@jiangjiajun , @lidanqing-intel , @yaomichael
Some update to let you know here we are. We have implememented first set of changes to speed up matmul (#44640) and now working similar changes for matmul int8. Here is the picture comparing performance(ernie-3.0 int8) of improved matmul fp32 (WW31) vs codebase without changes (WW29) :

ernie-3 0-int8-ww29vsww31

We can see that for more threads there is clear improvement (green line is over purple line). Overall multi-threading performance
scalability for ernie-3.0 int8 is around ~5.9 . When we have matmul int8 improved in similar way there will be additional performance gain.

@lidanqing-intel lidanqing-intel changed the title Multithreading scalability on Ernie INT8 with oneDNN and PaddleHPC-deepmd, Resnet50 without MKLDNN on CPU Multithreading scalability on Ernie INT8 with oneDNN Resnet50 without MKLDNN on CPU Aug 8, 2022
@lidanqing-intel lidanqing-intel changed the title Multithreading scalability on Ernie INT8 with oneDNN Resnet50 without MKLDNN on CPU Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU Aug 8, 2022
@lidanqing-intel
Copy link
Contributor Author

I removed the Paddle-Deepmd from this issue, because the PaddleHPC multithreading issue has been converted to:
Enabling Paddle-Deepmd using FP32 mkldnn.

@jczaja
Copy link
Contributor

jczaja commented Aug 9, 2022

@jiangjiajun , @lidanqing-intel , @yaomichael , @wozna , @AleksanderStankiewicz

Some update where we are with improving Ernie-3.0 int8 multi-threading scalability. We finished improving parallelization of matmul int8 and overall (whole model) multi-threading performance scalability is around ~6.6 (We improved from 5.9 -> 6.6). Here is a corresponding picture:
Ernie-3 0 int8 CLX onednn-numa-2 6-scalability-t1-8

Next Steps:
We will do some more profiling and breakdown to operator level to see if there is anything we can do to speedup even more. As NLP models tend to be memory bound we will reach some memoroy bandwidth limitation.

@jczaja
Copy link
Contributor

jczaja commented Aug 11, 2022

@jiangjiajun , @yaomichael , @AleksanderStankiewicz

To improve beyond 6.6x for Ernie-3.0 We need to optimize (both single-threaded and multi-threaded) a CPU operator: lookup_table_v2. This operator is not optimized for single-threaded and does not take advantage of multi-threading capabilities. We will not improve further without having it optimize.

------------------------- Event Summary -------------------------

Event Calls Total Min. Max. Ave. Ratio.
thread0::fc 4636 12935.3 0.031992 26.2104 2.7902 0.711151
thread0::elementwise_add 2562 1285.18 0.172503 5.55736 0.501633 0.070656
thread0::lookup_table_v2 488 1006.65 1.34375 21.8321 2.06281 0.055343
thread0::matmul 1464 961.73 0.26922 6.0957 0.656919 0.0528734

If you have suggestions on speeding up this operator then please share.

@jczaja
Copy link
Contributor

jczaja commented Aug 30, 2022

@jiangjiajun , @yaomichael , @AleksanderStankiewicz

Some update on this investigation. When investigation on ERNIE-3.0 was started here, only runtime quantization e.g. --enable_quantize was working so this scenario was analyzed. Meantime running ernie-3.0 int8 from saved model was fixed
and recently we switched to that use case. Here is a picture of multi-threaded performance scalability :
Ernie-3 0 int8 quantized onednn-2 6-scalability

So, for eight threads performance scalability for batch_size 100 is ~6.5 which is a bit lower than previosuly measured, but the absolute QPS raised from ~660 QPS to ~910 QPS due to recent improvements and the fact that we dropped --enable_quantize

Talking on improvements, there was a request to we quote a list of PRs that contributed to improvements:

Further improvements will be available with time as Intel oneDNN enginners keep improving FC & Matmul operations inside oneDNN as they are now a major reason for ERNIE-3.0 int8 not getting closer to 8x perf scalability for eight threads. On PaddlePaddle integration side this is the end of work on this issue.

@yaomichael
Copy link

@jiangjiajun @onecatcn @qingqing01 let's close this optimization for multi-threaded ernie3.0

@paddle-bot paddle-bot bot added status/close 已关闭 and removed status/new-issue 新建 labels Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants