Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExPO results to AlpacaEval #299

Merged
merged 1 commit into from
May 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4,832 changes: 4,832 additions & 0 deletions results/SPPO-Mistral7B-PairRM-ExPO/model_outputs.json

Large diffs are not rendered by default.

64,039 changes: 64,039 additions & 0 deletions results/SPPO-Mistral7B-PairRM-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/Starling-LM-7B-alpha-ExPO/model_outputs.json

Large diffs are not rendered by default.

60,843 changes: 60,843 additions & 0 deletions results/Starling-LM-7B-alpha-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/Starling-LM-7B-beta-ExPO/model_outputs.json

Large diffs are not rendered by default.

64,088 changes: 64,088 additions & 0 deletions results/Starling-LM-7B-beta-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/internlm2-chat-20b-ExPO/model_outputs.json

Large diffs are not rendered by default.

64,164 changes: 64,164 additions & 0 deletions results/internlm2-chat-20b-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/internlm2-chat-7b-ExPO/model_outputs.json

Large diffs are not rendered by default.

64,069 changes: 64,069 additions & 0 deletions results/internlm2-chat-7b-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/tulu-2-dpo-13b-ExPO/model_outputs.json

Large diffs are not rendered by default.

60,923 changes: 60,923 additions & 0 deletions results/tulu-2-dpo-13b-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/tulu-2-dpo-70b-ExPO/model_outputs.json

Large diffs are not rendered by default.

60,873 changes: 60,873 additions & 0 deletions results/tulu-2-dpo-70b-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/tulu-2-dpo-7b-ExPO/model_outputs.json

Large diffs are not rendered by default.

61,053 changes: 61,053 additions & 0 deletions results/tulu-2-dpo-7b-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/zephyr-7b-alpha-ExPO/model_outputs.json

Large diffs are not rendered by default.

63,829 changes: 63,829 additions & 0 deletions results/zephyr-7b-alpha-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

4,832 changes: 4,832 additions & 0 deletions results/zephyr-7b-beta-ExPO/model_outputs.json

Large diffs are not rendered by default.

60,949 changes: 60,949 additions & 0 deletions results/zephyr-7b-beta-ExPO/weighted_alpaca_eval_gpt4_turbo/annotations.json

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ FsfairX-Zephyr-Chat-v0.1,35.94648644102434,1.4410058098036145,285,517,3,805,35.5
Meta-Llama-3-70B-Instruct,33.17785695886864,1.3886514096065603,266,537,2,805,33.16770186335404,minimal,1919,34.42459717459881
gpt4_0613_verbose,23.237360043453418,1.283539505582624,171,630,4,805,21.490683229813666,dev,1473,33.82126688658535
mistral-large-2402,21.43877598137888,1.2485232545097724,166,638,1,805,20.6832298136646,verified,1362,32.65207998531868
SPPO-Mistral7B-PairRM-ExPO,35.4431306716895,1.398130896602677,274,531,0,805,34.037267080745345,community,2288,31.822321960655582
Samba-CoE-v0.2-best-of-16,26.988254318335404,1.3189030000371738,201,601,3,805,25.15527950310559,community,1578,31.506544268148147
Mixtral-8x22B-Instruct-v0.1,22.21017054750302,1.2780740057417268,174,628,3,805,21.801242236024844,verified,1445,30.878810294279383
SPPO-Mistral7B-PairRM,32.2453123637764,1.3908000109577154,249,556,0,805,30.93167701863354,community,2114,30.494137965217426
Expand All @@ -31,8 +32,11 @@ mistral-medium,21.855772543652176,1.2682402187223842,164,639,2,805,20.4968944099
claude-2,17.188240356708075,1.17482825615589,131,673,1,805,16.335403726708076,verified,1069,28.155196141629148
Samba-CoE-v0.2,21.847378669267083,1.2171089783436106,159,645,1,805,19.81366459627329,community,1469,27.62426735006872
claude,16.98534361236025,1.1687959793014906,129,676,0,805,16.024844720496894,verified,1082,27.289504443727107
internlm2-chat-20b-ExPO,46.185367468861,1.4638315245977938,375,430,0,805,46.58385093167702,community,3335,27.225759480731792
Yi-34B-Chat,29.65994671879504,1.3225712597906096,219,582,4,805,27.45341614906832,verified,2123,27.19054787762733
Starling-LM-7B-beta-ExPO,29.600851847906423,1.3252049542916096,225,580,0,805,27.95031055900621,community,2215,26.411156713811028
Snorkel-Mistral-PairRM-DPO,30.220052700671644,1.3328273012530358,231,572,1,804,28.79353233830846,community,2736,26.39144645733206
tulu-2-dpo-70b-ExPO,22.980619706104978,1.3591734082562228,184,620,1,805,22.919254658385093,community,1738,25.72330817134933
claude-instant-1.2,16.12739962159006,1.1341036838301686,120,682,3,805,15.093167701863356,community,1112,25.61225902543337
dbrx-instruct,18.44834898407453,1.255388020324377,150,655,0,805,18.633540372670808,verified,1450,25.37544974044448
claude-2.1,15.733506736409938,1.120315865445773,115,688,2,805,14.409937888198757,verified,1096,25.251943886133027
Expand All @@ -47,12 +51,14 @@ Mixtral-8x7B-Instruct-v0.1_concise,13.744040154795034,1.071868299237546,105,700,
Meta-Llama-3-8B-Instruct,22.56990260938061,1.257580233106669,176,626,3,805,22.049689440993788,minimal,1899,22.918784673210016
Samba-CoE-v0.1,16.835501870062114,1.1180386124646702,124,680,1,805,15.46583850931677,community,1316,22.865837334795227
gpt-3.5-turbo-16k-0613,14.13239070746584,1.027579400264853,96,704,5,805,12.236024844720497,verified,1328,22.720189163383225
internlm2-chat-7b-ExPO,28.067817437082898,1.3159792318125112,209,595,1,805,26.02484472049689,community,2390,22.66748024879648
gpt-3.5-turbo-0613,14.09579857390062,1.0371186215049395,99,700,6,805,12.670807453416147,community,1331,22.35251298054288
gpt-3.5-turbo-1106_verbose,12.76316981026087,1.044246819212278,94,709,2,805,11.801242236024844,dev,1058,22.00093702171442
gpt4_0613_concise,9.400320574596272,0.901021275896262,71,729,5,805,9.130434782608695,dev,627,21.57799091454269
pairrm-tulu-2-70b,18.638962967441,1.1924966700012911,140,665,0,805,17.391304347826086,community,1607,21.428403975507223
tulu-2-dpo-70b,15.982854374136648,1.1457861368237434,119,683,3,805,14.96894409937888,verified,1418,21.238610038371124
Mistral-7B-ReMax-v0.1,15.999331369031056,1.1288683901451453,120,683,2,805,15.031055900621118,community,1478,20.55136770233589
Starling-LM-7B-alpha-ExPO,18.179755920362158,1.2498324795896385,148,657,0,805,18.385093167701864,community,1821,19.4741654606294
gpt-3.5-turbo-1106,9.177964561962735,0.8904117511864436,64,737,4,805,8.198757763975156,verified,796,19.30058903498905
LMCocktail-10.7B-v1,13.153430917391304,1.045719535661201,104,700,1,805,12.981366459627331,community,1203,18.950710386651053
internlm2-chat-20b-ppo,21.74915450048448,1.2443662409548863,170,632,3,805,21.30434782608696,community,2373,18.748739485433603
Expand All @@ -61,6 +67,7 @@ gpt-3.5-turbo-0301,9.622453295105588,0.9129656686751644,71,733,1,805,8.881987577
xwinlm-13b-v0.1,17.42793475019876,1.1450161466942668,129,672,4,805,16.273291925465838,community,1894,17.918937898189796
deepseek-llm-67b-chat,12.093422264919258,1.017384363293138,90,713,2,805,11.304347826086955,community,1151,17.843384089909343
gpt35_turbo_instruct,8.462446504415423,0.8724086933609648,66,735,3,804,8.395522388059701,community,1018,17.72780108286588
tulu-2-dpo-13b-ExPO,15.551405429399557,1.171485338425437,121,679,5,805,15.341614906832298,community,1649,17.591404469940848
wizardlm-70b,14.383896086782608,1.0395048912985754,106,697,2,805,13.291925465838508,community,1545,17.575060737493747
vicuna-33b-v1.3,12.705947921540371,0.999255784310268,90,711,4,805,11.428571428571429,verified,1479,17.574575310874923
pairrm-tulu-2-13b,13.831901016757762,1.0835284665170843,110,694,1,805,13.72670807453416,community,1454,17.40520369795085
Expand All @@ -82,7 +89,9 @@ llama-2-70b-chat-hf,13.88825834374378,1.079984772728814,104,700,0,804,12.9353233
openchat-v3.1-13b,11.082230489416148,0.9501308701291292,80,720,5,805,10.248447204968944,community,1484,14.50338795683784
wizardlm-13b-v1.2,12.027480342770186,0.971761817748135,82,720,3,805,10.372670807453416,community,1635,14.462590694316631
ultralm-13b-v2.0-best-of-16,13.853373471242236,1.049344706038026,98,705,2,805,12.298136645962732,community,1720,14.198987566645036
zephyr-7b-beta-ExPO,11.06111683239833,1.0204784889272769,89,716,0,805,11.055900621118013,community,1405,14.001211980232686
wizardlm-13b-v1.1,11.233909572857142,0.95027112458742,79,723,3,805,10.0,community,1525,13.91572059284851
zephyr-7b-alpha-ExPO,10.55935434569986,0.977463444873356,79,725,1,805,9.875776397515528,community,1248,13.573089356781388
zephyr-7b-beta,10.992885755354038,0.9617876718039866,78,725,2,805,9.813664596273291,community,1444,13.203198493136666
dolphin-2.2.1-mistral-7b,9.039799728223604,0.8892901246776709,68,734,3,805,8.633540372670808,community,1130,13.121477650433736
humpback-llama-65b,9.425139047801242,0.9300866722901956,70,734,1,805,8.75776397515528,community,1232,12.799859995893623
Expand All @@ -92,6 +101,7 @@ Qwen-14B-Chat,7.502333484720497,0.8147265702205473,57,742,6,805,7.45341614906832
gpt4_gamed,3.7383373713788814,0.6278799633668313,32,771,2,805,4.099378881987578,community,68,12.188764057640531
cut-13b,10.779089202496897,0.9428953578911924,83,721,1,805,10.372670807453416,community,1637,12.154781753927743
openchat-v2-w-13b,9.615344158447204,0.8908241710735803,67,736,2,805,8.4472049689441,community,1566,12.03042777097436
tulu-2-dpo-7b-ExPO,11.529221038762385,1.049781489308991,91,714,0,805,11.304347826086957,community,1742,11.675059099417426
tulu-2-dpo-13b,10.119788388347828,0.929813366016608,75,728,2,805,9.440993788819876,community,1614,11.554479428088396
claude2-alpaca-13b,7.437351324770187,0.82494288683272,59,746,0,805,7.329192546583851,community,1127,11.498898213160734
minotaur-13b,5.738963669079602,0.7271241247374951,42,758,4,804,5.472636815920398,community,881,11.46525131683203
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -146,3 +146,13 @@ Qwen1.5-1.8B-Chat,-1.6003884852505712,0.9646855557741588,-4.6744303356917447
Qwen1.5-110B-Chat,-1.4481674391207744,0.9102999775192784,-0.2004892206655888
Storm-7B,-0.3778112670657819,0.5727965213879709,0.0000000000000000
SPPO-Mistral7B-PairRM,-1.0066475422582106,0.9046614612887018,-0.9905877944582094
SPPO-Mistral7B-PairRM-ExPO,-0.9297137384620632,0.7671267711136246,-0.8709792439039323
internlm2-chat-7b-ExPO,-1.1989304003616963,0.6968622384940820,-1.4260629123293445
internlm2-chat-20b-ExPO,-1.5168999523223183,0.5813753405809725,-1.0854248669923878
zephyr-7b-alpha-ExPO,-1.3654376133685404,1.1850496972307665,-2.7054511544809814
zephyr-7b-beta-ExPO,-1.2721841274702139,0.7693453406014726,-2.2016614295073751
Starling-LM-7B-alpha-ExPO,-1.1551552913433458,0.5427299165644314,-1.5681233228102427
Starling-LM-7B-beta-ExPO,-0.9995849824567026,0.8173555243885808,-1.2278737751496258
tulu-2-dpo-7b-ExPO,-1.2867594242188669,0.6986013668741516,-2.3831041176798933
tulu-2-dpo-13b-ExPO,-1.6247537554410800,0.6431373083501301,-1.7734311638958129
tulu-2-dpo-70b-ExPO,-1.2584665006823457,0.4518829275713181,-1.1294862478814247
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
SPPO-Mistral7B-PairRM-ExPO:
prompt_template: "SPPO-Mistral7B-PairRM/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/Mistral7B-PairRM-SPPO-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + SPPO-Mistral7B-PairRM"
link: "https://huggingface.co/chujiezheng/Mistral7B-PairRM-SPPO-ExPO"
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Starling-LM-7B-alpha-ExPO:
prompt_template: "Starling-LM-7B-alpha/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/Starling-LM-7B-alpha-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + Starling LM 7B alpha"
link: "https://huggingface.co/chujiezheng/Starling-LM-7B-alpha-ExPO"
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Starling-LM-7B-beta-ExPO:
prompt_template: "Starling-LM-7B-alpha/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/Starling-LM-7B-beta-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + Starling LM 7B beta"
link: "https://huggingface.co/chujiezheng/Starling-LM-7B-beta-ExPO"
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
internlm2-chat-20b-ExPO:
prompt_template: "internlm2-chat-20b-ppo/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/internlm2-chat-20b-ExPO"
model_kwargs:
torch_dtype: "bfloat16"
trust_remote_code: True
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + InternLM2 Chat 20B"
link: "https://huggingface.co/chujiezheng/internlm2-chat-20b-ExPO"
20 changes: 20 additions & 0 deletions src/alpaca_eval/models_configs/internlm2-chat-7b-ExPO/configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
internlm2-chat-7b-ExPO:
prompt_template: "internlm2-chat-20b-ppo/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/internlm2-chat-7b-ExPO"
model_kwargs:
torch_dtype: "bfloat16"
trust_remote_code: True
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + InternLM2 Chat 7B"
link: "https://huggingface.co/chujiezheng/internlm2-chat-7b-ExPO"
19 changes: 19 additions & 0 deletions src/alpaca_eval/models_configs/tulu-2-dpo-13b-ExPO/configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
tulu-2-dpo-13b-ExPO:
prompt_template: "tulu-2-dpo-70b/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/tulu-2-dpo-13b-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + Tulu-2-DPO-13B"
link: "https://huggingface.co/chujiezheng/tulu-2-dpo-13b-ExPO"
20 changes: 20 additions & 0 deletions src/alpaca_eval/models_configs/tulu-2-dpo-70b-ExPO/configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
tulu-2-dpo-70b-ExPO:
prompt_template: "tulu-2-dpo-70b/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/tulu-2-dpo-70b-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
tp: 2 # you need at least 2 A100 80GB GPUs to run this model
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + Tulu-2-DPO-70B"
link: "https://huggingface.co/chujiezheng/tulu-2-dpo-70b-ExPO"
19 changes: 19 additions & 0 deletions src/alpaca_eval/models_configs/tulu-2-dpo-7b-ExPO/configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
tulu-2-dpo-7b-ExPO:
prompt_template: "tulu-2-dpo-70b/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/tulu-2-dpo-7b-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + Tulu-2-DPO-7B"
link: "https://huggingface.co/chujiezheng/tulu-2-dpo-7b-ExPO"
19 changes: 19 additions & 0 deletions src/alpaca_eval/models_configs/zephyr-7b-alpha-ExPO/configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
zephyr-7b-alpha-ExPO:
prompt_template: "zephyr-7b-alpha-ExPO/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/zephyr-7b-alpha-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + Zephyr 7B Alpha"
link: "https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha"
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<|user|>
{instruction}</s>
<|assistant|>
19 changes: 19 additions & 0 deletions src/alpaca_eval/models_configs/zephyr-7b-beta-ExPO/configs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
zephyr-7b-beta-ExPO:
prompt_template: "zephyr-7b-alpha-ExPO/prompt.txt"
fn_completions: "vllm_local_completions"
completions_kwargs:
model_name: "chujiezheng/zephyr-7b-beta-ExPO"
model_kwargs:
torch_dtype: 'bfloat16'
tokenizer_mode: 'auto'
max_new_tokens: 2048
do_sample: True
seed: 42
temperature: 0.7
top_k: 50
top_p: 0.9
presence_penalty: 0.1
frequency_penalty: 0.1
batch_size: 1000
pretty_name: "ExPO + Zephyr 7B Beta"
link: "https://huggingface.co/HuggingFaceH4/zephyr-7b-beta"
Loading