-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215
Comments
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~ Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day! |
@yaomichael , @lidanqing-intel Just to let you know that we are working on this one. So to effectively utilize parallelization there need to be more data to parallelize for example batches of data (batch size > 1) . And we found that some implementations of oneDNN for bigger batches do not work as fast as they should be . So we are looking at this now. After we narrow down problems with big batches execution we will work on actual poor multi-threading scalability. |
@jiangjiajun , @lidanqing-intel , @yaomichael Now we start to look at actual multi-threading issues for supported models . |
@jiangjiajun , @lidanqing-intel , @yaomichael |
deep_md_test_no_model.zip |
@jiangjiajun , @lidanqing-intel , @yaomichael We run Ernie-3.0 int8 mutli-threading scalability test and here is visual representation: |
@jczaja Thank you very much for you investigation. Together with the email about Resnet50 INT8 throughput and latency, Resnet50 INT8 scalability is good. Hence we can remove Resnet50 INT8 from the multithreading issue. Now we can focus on Ernie-3.0 multithreading scalability. |
@jiangjiajun , @lidanqing-intel , @yaomichael |
@jiangjiajun , @lidanqing-intel , @yaomichael |
@jiangjiajun , @lidanqing-intel , @yaomichael We can see that for more threads there is clear improvement (green line is over purple line). Overall multi-threading performance |
I removed the Paddle-Deepmd from this issue, because the PaddleHPC multithreading issue has been converted to: |
@jiangjiajun , @lidanqing-intel , @yaomichael , @wozna , @AleksanderStankiewicz Some update where we are with improving Ernie-3.0 int8 multi-threading scalability. We finished improving parallelization of matmul int8 and overall (whole model) multi-threading performance scalability is around ~6.6 (We improved from 5.9 -> 6.6). Here is a corresponding picture: Next Steps: |
@jiangjiajun , @yaomichael , @AleksanderStankiewicz To improve beyond 6.6x for Ernie-3.0 We need to optimize (both single-threaded and multi-threaded) a CPU operator: lookup_table_v2. This operator is not optimized for single-threaded and does not take advantage of multi-threading capabilities. We will not improve further without having it optimize. ------------------------- Event Summary ------------------------- Event Calls Total Min. Max. Ave. Ratio. If you have suggestions on speeding up this operator then please share. |
@jiangjiajun , @yaomichael , @AleksanderStankiewicz Some update on this investigation. When investigation on ERNIE-3.0 was started here, only runtime quantization e.g. --enable_quantize was working so this scenario was analyzed. Meantime running ernie-3.0 int8 from saved model was fixed So, for eight threads performance scalability for batch_size 100 is ~6.5 which is a bit lower than previosuly measured, but the absolute QPS raised from ~660 QPS to ~910 QPS due to recent improvements and the fact that we dropped --enable_quantize Talking on improvements, there was a request to we quote a list of PRs that contributed to improvements:
Further improvements will be available with time as Intel oneDNN enginners keep improving FC & Matmul operations inside oneDNN as they are now a major reason for ERNIE-3.0 int8 not getting closer to 8x perf scalability for eight threads. On PaddlePaddle integration side this is the end of work on this issue. |
@jiangjiajun @onecatcn @qingqing01 let's close this optimization for multi-threaded ernie3.0 |
paddle-deepmd multithreading without MKLDNN is worse than other frameworks
Deepmd multithreading issue could been exported to a simple demo + Paddle without lammps, which is in deep_md_test.zip. Below text is just comparing tensorflow and Paddle, to reproduce deepmd + Paddle multithreading issue, you can skip below texts and go directly to Multithreading scalability on Ernie INT8 with oneDNN and Resnet50 without MKLDNN on CPU #43215 (comment)
Paddle Deepmd official website:
https://github.com/X4Science/paddle-deepmd
To easier reproduce multhreading issue (that paddle-deepmd multithreading is worse than tf deepmd-kit)
https://github.com/lidanqing-intel/deepmd-kit/blob/paddle-test/README.md
Reproduction environments:

Paddle version: eca6638
Test machine: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz (ICX)
Performance result:
Reproduce paddle-deepmd multithreading
PaddlePaddle- ernie3.0 INT8 with MKLDNN, try to improve multithreading scalability
Paddle:
0d719718b308587efcb6b3547f925582a8009176
model download https://paddlenlp.bj.bcebos.com/models/transformers/ernie_3.0/ernie3.0_medium_inference_models.zip
. After model distraction, there will be 4 files, (float32.pdmodel, float32.pdiparams) and (int8.pdmodel, int8.pdiparams)
, which is float32 model and int8 quant model.
Ernie-3.0 FP32 mkldnn, 1 thread on ICX is 65.45 QPS
python infer.py --task_name tnews --model_path /home/guest/PaddleNLP/model_zoo/ernie-3.0/ernie-3.0/float32--perf --device cpu --num_threads 1
Ernie-3.0 INT8 mkldnn, 1 thread on ICX is 153.77 QPS
python infer.py --task_name tnews --model_path /home/guest/PaddleNLP/model_zoo/ernie-3.0/ernie-3.0/int8 --perf --device cpu --num_threads 1 --enable_quantize
The text was updated successfully, but these errors were encountered: