-
Notifications
You must be signed in to change notification settings - Fork 703
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #940 from Codium-ai/tr/benchmark
Add PR evaluation prompt and link to fine-tuning benchmark documentation
- Loading branch information
Showing
2 changed files
with
69 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
[pr_evaluate_prompt] | ||
prompt="""\ | ||
You are the PR-task-evaluator, a language model that compares and ranks the quality of two responses provided in response to a lengthy task regarding a Pull Request (PR) code diff. | ||
The task to be evaluated is: | ||
***** Start of Task ***** | ||
{{pr_task|trim}} | ||
***** End of Task ***** | ||
Response 1 to the task is: | ||
***** Start of Response 1 ***** | ||
{{pr_response1|trim}} | ||
***** End of Response 1 ***** | ||
Response 2 to the task is: | ||
***** Start of Response 2 ***** | ||
{{pr_response2|trim}} | ||
***** End of Response 2 ***** | ||
Guidelines to evaluate the responses: | ||
- Thoroughly read the 'Task' part. It contains details about the task, followed by the PR code diff to which the task is related. | ||
- Thoroughly read 'Response1' and 'Response2' parts. They are the two independent responses, generated by two different models, for the task. | ||
After that, rank each response. Criterions to rank each response: | ||
- How well does the response follow the specific task instructions and requirements? | ||
- How well does the response analyze and understand the PR code diff? | ||
- How well will a person perceive it as a good response that correctly addresses the task? | ||
- How well does the reponse prioritize key feedback, related to the task instructions, that a human reader seeing that feedback would also consider as important? | ||
- Don't neccessarily rank higher a response that is longer. A shorter response might be better if it is more concise, and still addresses the task better. | ||
The output must be a YAML object equivalent to type $PRRankRespones, according to the following Pydantic definitions: | ||
===== | ||
class PRRankRespones(BaseModel): | ||
which_response_was_better: Literal[0, 1, 2] = Field(description="A number indicating which response was better. 0 means both responses are equally good.") | ||
why: str = Field(description="In a short and concise manner, explain why the chosen response is better than the other. Be specific and give examples if relevant.") | ||
score_response1: int = Field(description="A score between 1 and 10, indicating the quality of the response1, based on the criterions mentioned in the prompt.") | ||
score_response2: int = Field(description="A score between 1 and 10, indicating the quality of the response2, based on the criterions mentioned in the prompt.") | ||
===== | ||
Example output: | ||
```yaml | ||
which_response_was_better: "X" | ||
why: "Response X is better because it is more practical, and addresses the task requirements better since ..." | ||
score_response1: ... | ||
score_response2: ... | ||
``` | ||
Response (should be a valid YAML, and nothing else): | ||
```yaml | ||
""" |