forked from JayZhang42/FederatedGPT-Shepherd
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtemplate.yaml
46 lines (40 loc) · 2.52 KB
/
template.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# legend
# name is the hf name of the model
# prompt is the typical text-generation prompt used by the model
# evaluator_prompt is the formatted prompt to use for laaj (llm-as-a-judge)
# evaluator_prompt should place the expected placeholder first and then the generated
meta-llama/Meta-Llama-3-8B-Instruct:
candidate_prompt: <|start_header_id|>system<|end_header_id|> {}<|eot_id|><|start_header_id|>user<|end_header_id|> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
candidate_generation_config:
max_new_tokens: 256
do_sample: True
temperature: 0.6
top_p: 0.9
evaluator_prompt: |
<|start_header_id|>system<|end_header_id|>
You are going to act as an LLM evaluator to rate the answer of the LLM generated medical chatbot on factualness (i.e how contextually the generated output followed the expected reply). Penalize it appropriately for any hallucination/loss of context, refraining to answer (saying 'As an AI Language Model', 'I am not a doctor, but' ...), or trailing repetition. YOUR RESPONSE IS NOTHING ELSE BUT A FLOAT FROM 0.0 - 5.0 (with format x.x). Where 0.0 indicates the context of the generated response is very far from the expected one. And 5.0 represents otherwise. AGAIN IF YOUR GENERATED ANYTHING ELSE BUT A FLOAT YOU'RE GOING TO CRUSH MY SYSTEM!!<|eot_id|>
<|start_header_id|>user<|end_header_id|>
### Expected: {}
### Generated: {}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
evaluator_generation_config:
max_new_tokens: 256
do_sample: True
temperature: 0.6
top_p: 0.9
microsoft/Phi-3-mini-4k-instruct:
candidate_prompt: |
<|system|>
{}<|end|>
<|user|>
{}<|end|>
<|assistant|>
evaluator_prompt:
Qwen/Qwen2-7B-Instruct:
evaluator_prompt: |
<|im_start|>system
You are going to act as an LLM evaluator to rate the answer of the LLM generated medical chatbot on factualness (i.e how contextually the generated output followed the expected reply). Penalize it appropriately for any hallucination/loss of context, refraining to answer (saying 'As an AI Language Model', 'I am not a doctor, but' ...), or trailing repetition. YOUR RESPONSE IS NOTHING ELSE BUT A FLOAT FROM 0.0 - 5.0 (with format x.x). Where 0.0 indicates the context of the generated response is very far from the expected one. And 5.0 represents otherwise. AGAIN IF YOUR GENERATED ANYTHING ELSE BUT A FLOAT YOU'RE GOING TO CRUSH MY SYSTEM!!<|im_end|>
<|im_start|>user
### Expected: {}
### Generated: {}<|eot_id|><|im_end|>
<|im_start|>assistant