MastermindEval

Evaluating Reasoning Capabilities of LLMs Using the Mastermind Board Game.

Installation | Evaluation Paradigms | Basic Concepts | Running the Evaluations | Citation

🚀 Installation

To set up the environment and install dependencies, run the following:

conda create -n mastermind python=3.11
conda activate mastermind
pip install -e .

🏆 Evaluation Paradigms

We provide three different evaluation paradigms:

🤖 Agentic Evaluation: The model actively plays Mastermind, interacting with the game environment.
📝 Prompt-Based Evaluation: The model is presented with pre-played game scenarios and must deduce the last possible code.
🎯 Multiple-Choice Evaluation: The model ranks different code options based on log-likelihood, aligning with pretraining objectives. You can run your evaluation using the awesome lm-eval-harness.

🔑 Basic Concepts

🧩 Model Class: Defines the LLM interface that interacts with the game. We provide support for:
- Hugging Face Model Hub
- OpenAI
- Anthropic
🎲 Mastermind Game Class: Represents a game instance with customizable parameters such as num_colors and possible_colors.
📊 Evaluator Class: Manages the evaluation process by executing multiple rounds of the game and assessing model performance.

Running the Evaluations

We provide various scripts for running different evaluation methods:

🤖 Agentic Evaluation: run_full_game.py (Python script) and run_full_game.sh (Bash script)
📝 Prompt-Based Evaluation: run_instructions.py (Python script) and run_instructions.sh (Bash script) - These splits are also availabe on the 🤗 Hugging Face hub!
🎯 Multiple-Choice Evaluation: run_multiple_choice.sh (Bash) – relies on lm-eval-harness. These splits are also availabe on the 🤗 Hugging Face hub!

Example Usage

Below is a conceptual overview of running an evaluation using a Hugging Face model:

from mastermind.evaluator import Evaluator
from mastermind.game import Mastermind
from mastermind.models import HFModel
from mastermind.utils import print_summary

# Load the model
model = HFModel(model_name='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B')

# Initialize the game environment
game = Mastermind(code_length=4, num_colors=6)

# Create the evaluator
evaluator = Evaluator(game, model, use_cot=True, use_fewshot_example=True)

# Run the evaluation
result = evaluator.run(num_games=100, save_results=True, save_path="results", compute_progress=True)

# Display summary
print_summary(model, game, result, num_runs=100)

📚 Citation

@inproceedings{
  golde2025mastermindeval,
  title={MastermindEval: A Simple But Scalable Reasoning Benchmark},
  author={Jonas Golde and Patrick Haller and Fabio Barth and Alan Akbik},
  booktitle={Workshop on Reasoning and Planning for Large Language Models},
  year={2025},
  url={https://openreview.net/forum?id=H4donosutm}
}

For any issues, feel free to open an issue or contribute to the repository! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
src/mastermind		src/mastermind
tests		tests
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
create_eval_harness_splits.py		create_eval_harness_splits.py
mastermind.code-workspace		mastermind.code-workspace
mastermind.png		mastermind.png
pyproject.toml		pyproject.toml
run_full_game.py		run_full_game.py
run_full_game.sh		run_full_game.sh
run_instructions.py		run_instructions.py
run_instructions.sh		run_instructions.sh
run_multiple_choice.sh		run_multiple_choice.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MastermindEval

🚀 Installation

🏆 Evaluation Paradigms

🔑 Basic Concepts

Running the Evaluations

Example Usage

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

flairNLP/mastermind

Folders and files

Latest commit

History

Repository files navigation

MastermindEval

🚀 Installation

🏆 Evaluation Paradigms

🔑 Basic Concepts

Running the Evaluations

Example Usage

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages