Skip to content

Latest commit

 

History

History
75 lines (49 loc) · 6.75 KB

File metadata and controls

75 lines (49 loc) · 6.75 KB

Ghostwriter evaluation results 2024-12-21_13-57-31

There are 4 scenarios and 4 test cases with 3 attempts (48 total tests).

Test: blank_math

claude_sonnet_latest_with_seg

10

gpt-4o-mini_no_seg

gpt-4o_with_seg

10

claude_sonnet_latest_no_seg

Test: tic_tac_toe_1

claude_sonnet_latest_with_seg

Your turn! Place an O anywhere you'd like.

gpt-4o-mini_no_seg

gpt-4o_with_seg

claude_sonnet_latest_no_seg

Test: x_in_box

claude_sonnet_latest_with_seg

gpt-4o-mini_no_seg

gpt-4o_with_seg

claude_sonnet_latest_no_seg

Test: x_in_boxes

claude_sonnet_latest_with_seg

gpt-4o-mini_no_seg

gpt-4o_with_seg

claude_sonnet_latest_no_seg