MCP Evaluation System

A comprehensive command-line tool for evaluating MCP (Model Context Protocol) server performance and tool effectiveness from a user perspective. The system executes real-world scenarios, records detailed interaction logs, compares actual vs expected trajectories, and provides quantitative metrics with visual HTML reports.

🚀 Quick Start

cp .env.example .env                                    # Configure paths
pip install -e .                                       # Install as package
./testing/reset-mcpproxy.sh                            # Start Docker MCPProxy

# Pytest-style runner (recommended)
PYTHONPATH=src uv run python -m mcp_eval.cli test --scenario scenarios/list_all_servers.yaml

# Or traditional workflow
PYTHONPATH=src uv run python -m mcp_eval.cli record --scenario scenarios/search_tools_simple.yaml
PYTHONPATH=src uv run python -m mcp_eval.cli compare --scenario scenarios/search_tools_simple.yaml --baseline baselines/search_tools_simple_baseline/search_tools_simple_baseline

Overview

The MCP Evaluation System helps developers and researchers:

Evaluate MCP Server Performance: Test real-world scenarios against MCP proxy and individual servers
Measure Tool Effectiveness: Quantify how well MCP tools execute user intents
Trajectory Analysis: Compare actual tool usage patterns vs expected patterns using dialog trajectory metrics
Regression Testing: Ensure MCP implementations maintain quality across versions
Visual Analysis: Generate HTML reports with side-by-side conversation comparisons

Architecture

Core Components

CLI Interface: Click-based command parser with baseline recording, evaluation, and comparison modes
FailureAwareScenarioRunner: Enhanced scenario executor with tool discovery and git version tracking
Docker Isolation: MCPProxy runs in containerized environment for reproducible, isolated testing
HTML Reporter: Generates comprehensive visual reports with expandable tool calls and conversation logs
Trajectory Evaluator: Compares execution patterns using dialog trajectory similarity metrics

Evaluation Flow

Scenario Definition → Baseline Recording → Current Evaluation → Trajectory Comparison → HTML Report
     (YAML)              (Docker)            (Docker)           (Metrics)          (Visual)

Prerequisites

Python 3.11+ with uv package manager
Docker for MCPProxy isolation
MCPProxy Go project (configurable location)

Installation

# Clone and install
git clone https://github.com/anthropics/mcp-eval.git
cd mcp-eval
uv sync

# Install as development package
pip install -e .

Configuration

The system is designed to be path-independent and configurable. Set up your environment:

1. Environment Variables

Copy the example environment file and configure paths:

cp .env.example .env
# Edit .env with your specific paths

Key configuration variables:

# Path to MCPProxy source code (required for building proxy binary)
MCPPROXY_SOURCE_PATH=../mcpproxy-go  # or absolute path to your mcpproxy-go clone

# Your Anthropic API key (required for baseline recording)
ANTHROPIC_API_KEY=your_api_key_here

# Optional: Custom configuration paths
MCP_SERVERS_CONFIG=./mcp_servers_test.json
TEST_SESSION=test777-dind
TEST_PORT=8081

2. MCPProxy Source

Ensure MCPProxy source is available:

# Option 1: Clone next to this repository (recommended)
cd ..
git clone https://github.com/modelcontextprotocol/mcpproxy-go.git

# Option 2: Set custom path in .env
echo "MCPPROXY_SOURCE_PATH=/path/to/your/mcpproxy-go" >> .env

3. Initial Setup

# Setup Docker MCPProxy (will use your configured paths)
./testing/reset-mcpproxy.sh

Usage

1. Reset MCPProxy State (Required Before Each Run)

CRITICAL: Always reset MCPProxy docker container state before each baseline recording or evaluation run to ensure reproducible results.

# Reset using the script (uses your configured paths)
./testing/reset-mcpproxy.sh

# Manual restart (if needed)
cd testing/docker
TEST_SESSION=test777-dind docker compose down
TEST_SESSION=test777-dind docker compose up -d

2. Record Baseline (Reference Implementation)

Record a baseline execution that represents the expected behavior:

# Reset MCPProxy state first (see step 1)
./testing/reset-mcpproxy.sh

# Record baseline (output defaults to baselines/{scenario_name}_baseline)
PYTHONPATH=src uv run python -m mcp_eval.cli record --scenario scenarios/search_tools_simple.yaml

# View generated HTML report
open reports/search_tools_simple_baseline_*.html

3. Run Evaluation (Current Implementation)

Execute the current implementation and compare against the baseline:

# Reset MCPProxy state first (see step 1)
./testing/reset-mcpproxy.sh

# Run comparison (output defaults to comparison_results/{scenario_name}_comparison)
PYTHONPATH=src uv run python -m mcp_eval.cli compare --scenario scenarios/search_tools_simple.yaml \
  --baseline baselines/search_tools_simple_baseline/search_tools_simple_baseline

# View results
open reports/search_tools_simple_comparison_*.html

4. View Results

HTML Reports provide comprehensive visual analysis:

Baseline Reports: Complete conversation logs, tool calls, termination analysis, MCPProxy version tracking
Comparison Reports: Side-by-side current vs baseline execution with trajectory metrics

Key Metrics:

Tool Trajectory Score: Sophisticated similarity-based comparison of MCP tool usage patterns (0.0-1.0)
Per-Invocation Analysis: Detailed similarity scores for each individual tool call with visual comparison
MCP-Only Filtering: Focuses evaluation on MCP tool calls only (excludes TodoWrite, Bash, etc.)
Multi-Level Similarity: Evaluates tool name matching, argument key similarity, and value similarity using multiple algorithms
Pass/Fail Threshold: 0.8 (configurable)

Similarity-Based Trajectory Evaluation

Overview

The MCP Evaluation System uses sophisticated similarity calculations to compare tool usage patterns between current and baseline executions. This approach provides more nuanced evaluation than simple exact matching.

MCP-Only Focus

The system filters comparisons to only MCP tool calls (tools with mcp__ prefix), excluding framework tools like:

TodoWrite (task management)
Bash (command execution)
Read, Write, Edit (file operations)

This ensures evaluation focuses on actual MCP server interactions rather than agent implementation details.

Multi-Level Similarity Calculation

Tool Call Similarity (0.0-1.0)

Each tool call comparison evaluates:

Tool Name Matching: Must be identical (different tools = 0.0 similarity)
Argument Similarity: 30% key structure + 70% value similarity
- Key Similarity: Jaccard similarity of argument keys
- Value Similarity: Multi-method comparison:
  - String values: Word intersection with Jaccard similarity
  - Numeric values: Distance-based with configurable thresholds
  - JSON objects: Cosine similarity using character frequency vectors

Trajectory Similarity

The overall trajectory score averages individual tool call similarities, providing a single metric for execution quality assessment.

Algorithms Used

Jaccard Similarity: For set-based comparisons (keys, word sets)
String Intersection: Word-level comparison for natural language queries
Distance-Based Numeric: Configurable thresholds for numeric variations
Cosine Similarity: Character frequency analysis for complex JSON structures

Available Scenarios

The system includes 19+ comprehensive test scenarios covering all major MCPProxy functionality:

Core Functionality

list_all_servers - Server discovery
basic_tool_search - Tool discovery with BM25 search
list_quarantined_servers - Security quarantine listing

Server Management

add_simple_server - Add new MCP servers
remove_server - Remove existing servers
check_server_logs - Server log inspection

Security Operations

inspect_quarantined_server - Detailed security analysis
server_status_check - Configuration validation

Registry Operations

list_registries - Registry discovery
search_docker_registry - Docker registry search

GitHub Integration

github_tool_discovery - GitHub tool discovery
And more...

CLI Reference

Commands

# Record a baseline execution
PYTHONPATH=src uv run python -m mcp_eval.cli record --scenario <scenario_file> [--output <output_dir>]

# Compare against baseline
PYTHONPATH=src uv run python -m mcp_eval.cli compare --scenario <scenario_file> --baseline <baseline_dir> [--output <output_dir>]

# Run multiple scenarios in batch
PYTHONPATH=src uv run python -m mcp_eval.cli batch --scenarios <scenarios_dir> [--output <output_dir>]

# Test runner - pytest-style scenario execution with compact output
PYTHONPATH=src uv run python -m mcp_eval.cli test [--tag <tag>] [--scenario <file>] [--fail-fast]

Test Command - Pytest-Style Runner

The test command provides a pytest-style interface for running MCP scenarios with compact output, automatic MCPProxy state management, and flexible filtering:

# Run all enabled scenarios
PYTHONPATH=src uv run python -m mcp_eval.cli test

# Filter scenarios by tags (security, server_management, tool_discovery, etc.)
PYTHONPATH=src uv run python -m mcp_eval.cli test --tag security --tag quarantine

# Run specific scenario files
PYTHONPATH=src uv run python -m mcp_eval.cli test --scenario scenarios/list_all_servers.yaml --scenario scenarios/search_tools_simple.yaml

# Stop on first failure (like pytest -x)
PYTHONPATH=src uv run python -m mcp_eval.cli test --tag server_management --fail-fast

# Verbose output for debugging
PYTHONPATH=src uv run python -m mcp_eval.cli test --scenario scenarios/debug_scenario.yaml --verbose

Output Format:

🧪 Running 3 scenarios
   Filtered by tags: security

list_all_servers               PASS   1.00
search_tools_simple            FAIL   0.54
new_scenario                   RECORDED    N/A

✅ 1 passed, 1 recorded, 1 failed

Test Runner Features:

Automatic MCPProxy Restart: Ensures clean state between test runs
Baseline Comparison: Compares against existing baselines if available, records new ones otherwise
Tag Filtering: Filter scenarios using tags like security, server_management, tool_discovery
File Selection: Run specific scenario files instead of entire directories
Compact Output: Pytest-style output showing scenario name, status (PASS/FAIL/RECORDED/ERROR), and similarity score
Fail-Fast Mode: Stop execution on first failure for quick debugging
Status Types:
- PASS: Similarity score ≥ 0.8 compared to baseline
- FAIL: Similarity score < 0.8 or execution issues
- RECORDED: New baseline recorded (no existing baseline found)
- ERROR: Scenario loading or execution failure

Options

--scenario: Path to YAML scenario file
--baseline: Path to baseline directory for comparison
--output: Output directory for results
--mcp-config: MCP servers configuration file (default: mcp_servers.json)
--tag: Filter scenarios by tag (can be used multiple times)
--fail-fast: Stop on first failure
--verbose: Enable verbose output for debugging

Development

Running Tests

# Run all unit tests
PYTHONPATH=src uv run python -m pytest tests/ -v

# Run specific test file
PYTHONPATH=src uv run python -m pytest tests/test_similarity.py -v

Adding New Scenarios

Create a YAML file in scenarios/ directory
Define user intent, expected trajectory, and success criteria
Optionally create custom config file in configs/
Test with baseline recording

Example scenario structure:

enabled: true
name: "My Test Scenario"
description: "Test description"
config_file: "configs/minimal_config.json"
user_intent: "What the user wants to accomplish"

expected_trajectory:
  - action: "tool_action"
    tool: "mcp__tool_name"
    args:
      parameter: "value"

success_criteria:
  - "keyword_in_response"
  - "expected_behavior"

tags:
  - "category"

Troubleshooting

Common Issues

MCPProxy container fails to start

# Check Docker is running
docker info

# Verify MCPProxy source exists
ls $MCPPROXY_SOURCE_PATH

# Check container logs
cd testing/docker && docker compose logs

Tool discovery fails
- This is normal and handled gracefully
- Tool discovery failure doesn't affect scenario execution
- MCP tools remain functional during conversations

Permission errors

# Ensure scripts are executable
chmod +x testing/reset-mcpproxy.sh
chmod +x testing/build-mcpproxy.sh

Debug Mode

Enable detailed logging by setting environment variables:

export LOG_LEVEL=debug
export PYTHONPATH=src
uv run python -m mcp_eval.cli record --scenario scenarios/your_scenario.yaml

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

MIT License - see LICENSE file for details.

Support

Issues: Report bugs and feature requests via GitHub Issues
Documentation: See CLAUDE.md for detailed implementation notes
Examples: Check scenarios/ directory for usage examples

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
docs		docs
scenarios		scenarios
src/mcp_eval		src/mcp_eval
testing		testing
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
claude_settings.json		claude_settings.json
mcp_servers.json		mcp_servers.json
mcp_servers_test.json		mcp_servers_test.json
pyproject.toml		pyproject.toml
restart-mcpproxy.sh		restart-mcpproxy.sh
uv.lock		uv.lock

smart-mcp-proxy/mcp-eval

Folders and files

Latest commit

History

Repository files navigation

MCP Evaluation System

🚀 Quick Start

Overview

Architecture

Core Components

Evaluation Flow

Prerequisites

Installation

Configuration

1. Environment Variables

2. MCPProxy Source

3. Initial Setup

Usage

1. Reset MCPProxy State (Required Before Each Run)

2. Record Baseline (Reference Implementation)

3. Run Evaluation (Current Implementation)

4. View Results

Similarity-Based Trajectory Evaluation

Overview

MCP-Only Focus

Multi-Level Similarity Calculation

Tool Call Similarity (0.0-1.0)

Trajectory Similarity

Algorithms Used

Available Scenarios

Core Functionality

Server Management

Security Operations

Registry Operations

GitHub Integration

CLI Reference

Commands

Test Command - Pytest-Style Runner

Options

Development

Running Tests

Adding New Scenarios

Troubleshooting

Common Issues

Debug Mode

Contributing

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages