Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Track] DeepSeek V3/R1 accuracy #3486

Open
zhyncs opened this issue Feb 11, 2025 · 3 comments
Open

[Track] DeepSeek V3/R1 accuracy #3486

zhyncs opened this issue Feb 11, 2025 · 3 comments

Comments

@zhyncs
Copy link
Member

zhyncs commented Feb 11, 2025

conclusion

gsm8k and mmlu are completely consistent with the official release

server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code

gsm8k

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.955
Invalid: 0.000
Latency: 109.212 s
Output throughput: 1244.611 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.750
subject: anatomy, #q:135, acc: 0.844
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.630
subject: college_computer_science, #q:100, acc: 0.860
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.884
subject: college_physics, #q:102, acc: 0.833
subject: computer_security, #q:100, acc: 0.880
subject: conceptual_physics, #q:235, acc: 0.928
subject: econometrics, #q:114, acc: 0.754
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.942
subject: formal_logic, #q:126, acc: 0.794
subject: global_facts, #q:100, acc: 0.670
subject: high_school_biology, #q:310, acc: 0.955
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.921
subject: high_school_mathematics, #q:270, acc: 0.756
subject: high_school_microeconomics, #q:238, acc: 0.966
subject: high_school_physics, #q:151, acc: 0.828
subject: high_school_psychology, #q:545, acc: 0.971
subject: high_school_statistics, #q:216, acc: 0.856
subject: high_school_us_history, #q:204, acc: 0.956
subject: high_school_world_history, #q:237, acc: 0.945
subject: human_aging, #q:223, acc: 0.852
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.959
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.949
subject: medical_genetics, #q:100, acc: 0.940
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.887
subject: moral_scenarios, #q:895, acc: 0.773
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.865
subject: professional_law, #q:1534, acc: 0.702
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.836
subject: security_studies, #q:245, acc: 0.890
subject: sociology, #q:201, acc: 0.960
subject: us_foreign_policy, #q:100, acc: 0.930
subject: virology, #q:166, acc: 0.584
subject: world_religions, #q:171, acc: 0.924
Total latency: 274.759
Average accuracy: 0.871
@yinfan98
Copy link
Contributor

yinfan98 commented Feb 11, 2025

some 8 * H20 accuracy for deepseek-v3, cc: @zhyncs

Server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --mem-fraction-static 0.9

gsmk8

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.950
Invalid: 0.000
Latency: 236.747 s
Output throughput: 587.916 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.820
subject: anatomy, #q:135, acc: 0.881
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.917
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.650
subject: college_computer_science, #q:100, acc: 0.830
subject: college_mathematics, #q:100, acc: 0.800
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.814
subject: computer_security, #q:100, acc: 0.890
subject: conceptual_physics, #q:235, acc: 0.949
subject: econometrics, #q:114, acc: 0.807
subject: electrical_engineering, #q:145, acc: 0.876
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.810
subject: global_facts, #q:100, acc: 0.730
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.885
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.990
subject: high_school_macroeconomics, #q:390, acc: 0.931
subject: high_school_mathematics, #q:270, acc: 0.752
subject: high_school_microeconomics, #q:238, acc: 0.954
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.961
subject: high_school_statistics, #q:216, acc: 0.861
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.949
subject: human_aging, #q:223, acc: 0.870
subject: human_sexuality, #q:131, acc: 0.924
subject: international_law, #q:121, acc: 0.975
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.914
subject: machine_learning, #q:112, acc: 0.857
subject: management, #q:103, acc: 0.961
subject: marketing, #q:234, acc: 0.962
subject: medical_genetics, #q:100, acc: 0.960
subject: miscellaneous, #q:783, acc: 0.962
subject: moral_disputes, #q:346, acc: 0.864
subject: moral_scenarios, #q:895, acc: 0.806
subject: nutrition, #q:306, acc: 0.922
subject: philosophy, #q:311, acc: 0.929
subject: prehistory, #q:324, acc: 0.935
subject: professional_accounting, #q:282, acc: 0.869
subject: professional_law, #q:1534, acc: 0.720
subject: professional_medicine, #q:272, acc: 0.952
subject: professional_psychology, #q:612, acc: 0.907
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.869
subject: sociology, #q:201, acc: 0.945
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.930
Total latency: 435.171
Average accuracy: 0.878

@ictzyqq
Copy link

ictzyqq commented Feb 17, 2025

It seems that deepseek-V3/R1 using sglang cannot achieve the 88.5/90.8 accuracy of MMLU claimed in the paper? I wonder how to reproduce the MMLU accuracy in the paper.

@yinfan98
Copy link
Contributor

Test EP8 DeepSeek-V3 accuracy cc: @zhyncs @sleepcoo , for this pr : #3602

Device

8 * H200

Server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-dp-attention --enable-ep-moe

gsmk8

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
Accuracy: 0.952
Invalid: 0.000
Latency: 154.554 s
Output throughput: 893.419 token/s

mmlu

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000
subject: abstract_algebra, #q:100, acc: 0.790
subject: anatomy, #q:135, acc: 0.881
subject: astronomy, #q:152, acc: 0.921
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.921
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.660
subject: college_computer_science, #q:100, acc: 0.850
subject: college_mathematics, #q:100, acc: 0.770
subject: college_medicine, #q:173, acc: 0.867
subject: college_physics, #q:102, acc: 0.843
subject: computer_security, #q:100, acc: 0.890
subject: conceptual_physics, #q:235, acc: 0.945
subject: econometrics, #q:114, acc: 0.789
subject: electrical_engineering, #q:145, acc: 0.883
subject: elementary_mathematics, #q:378, acc: 0.944
subject: formal_logic, #q:126, acc: 0.825
subject: global_facts, #q:100, acc: 0.690
subject: high_school_biology, #q:310, acc: 0.958
subject: high_school_chemistry, #q:203, acc: 0.887
subject: high_school_computer_science, #q:100, acc: 0.930
subject: high_school_european_history, #q:165, acc: 0.885
subject: high_school_geography, #q:198, acc: 0.955
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.926
subject: high_school_mathematics, #q:270, acc: 0.759
subject: high_school_microeconomics, #q:238, acc: 0.958
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.960
subject: high_school_statistics, #q:216, acc: 0.847
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.949
subject: human_aging, #q:223, acc: 0.861
subject: human_sexuality, #q:131, acc: 0.924
subject: international_law, #q:121, acc: 0.975
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.908
subject: machine_learning, #q:112, acc: 0.848
subject: management, #q:103, acc: 0.942
subject: marketing, #q:234, acc: 0.957
subject: medical_genetics, #q:100, acc: 0.950
subject: miscellaneous, #q:783, acc: 0.958
subject: moral_disputes, #q:346, acc: 0.873
subject: moral_scenarios, #q:895, acc: 0.800
subject: nutrition, #q:306, acc: 0.915
subject: philosophy, #q:311, acc: 0.913
subject: prehistory, #q:324, acc: 0.932
subject: professional_accounting, #q:282, acc: 0.876
subject: professional_law, #q:1534, acc: 0.716
subject: professional_medicine, #q:272, acc: 0.949
subject: professional_psychology, #q:612, acc: 0.908
subject: public_relations, #q:110, acc: 0.800
subject: security_studies, #q:245, acc: 0.882
subject: sociology, #q:201, acc: 0.950
subject: us_foreign_policy, #q:100, acc: 0.950
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.930
Total latency: 1153.812
Average accuracy: 0.876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants