A comprehensive command-line tool for evaluating MCP (Model Context Protocol) server performance and tool effectiveness from a user perspective. The system executes real-world scenarios, records detailed interaction logs, compares actual vs expected trajectories, and provides quantitative metrics with visual HTML reports.
cp .env.example .env # Configure paths
pip install -e . # Install as package
./testing/reset-mcpproxy.sh # Start Docker MCPProxy
# Pytest-style runner (recommended)
PYTHONPATH=src uv run python -m mcp_eval.cli test --scenario scenarios/list_all_servers.yaml
# Or traditional workflow
PYTHONPATH=src uv run python -m mcp_eval.cli record --scenario scenarios/search_tools_simple.yaml
PYTHONPATH=src uv run python -m mcp_eval.cli compare --scenario scenarios/search_tools_simple.yaml --baseline baselines/search_tools_simple_baseline/search_tools_simple_baseline
The MCP Evaluation System helps developers and researchers:
- Evaluate MCP Server Performance: Test real-world scenarios against MCP proxy and individual servers
- Measure Tool Effectiveness: Quantify how well MCP tools execute user intents
- Trajectory Analysis: Compare actual tool usage patterns vs expected patterns using dialog trajectory metrics
- Regression Testing: Ensure MCP implementations maintain quality across versions
- Visual Analysis: Generate HTML reports with side-by-side conversation comparisons
- CLI Interface: Click-based command parser with baseline recording, evaluation, and comparison modes
- FailureAwareScenarioRunner: Enhanced scenario executor with tool discovery and git version tracking
- Docker Isolation: MCPProxy runs in containerized environment for reproducible, isolated testing
- HTML Reporter: Generates comprehensive visual reports with expandable tool calls and conversation logs
- Trajectory Evaluator: Compares execution patterns using dialog trajectory similarity metrics
Scenario Definition → Baseline Recording → Current Evaluation → Trajectory Comparison → HTML Report
(YAML) (Docker) (Docker) (Metrics) (Visual)
- Python 3.11+ with uv package manager
- Docker for MCPProxy isolation
- MCPProxy Go project (configurable location)
# Clone and install
git clone https://github.com/anthropics/mcp-eval.git
cd mcp-eval
uv sync
# Install as development package
pip install -e .
The system is designed to be path-independent and configurable. Set up your environment:
Copy the example environment file and configure paths:
cp .env.example .env
# Edit .env with your specific paths
Key configuration variables:
# Path to MCPProxy source code (required for building proxy binary)
MCPPROXY_SOURCE_PATH=../mcpproxy-go # or absolute path to your mcpproxy-go clone
# Your Anthropic API key (required for baseline recording)
ANTHROPIC_API_KEY=your_api_key_here
# Optional: Custom configuration paths
MCP_SERVERS_CONFIG=./mcp_servers_test.json
TEST_SESSION=test777-dind
TEST_PORT=8081
Ensure MCPProxy source is available:
# Option 1: Clone next to this repository (recommended)
cd ..
git clone https://github.com/modelcontextprotocol/mcpproxy-go.git
# Option 2: Set custom path in .env
echo "MCPPROXY_SOURCE_PATH=/path/to/your/mcpproxy-go" >> .env
# Setup Docker MCPProxy (will use your configured paths)
./testing/reset-mcpproxy.sh
CRITICAL: Always reset MCPProxy docker container state before each baseline recording or evaluation run to ensure reproducible results.
# Reset using the script (uses your configured paths)
./testing/reset-mcpproxy.sh
# Manual restart (if needed)
cd testing/docker
TEST_SESSION=test777-dind docker compose down
TEST_SESSION=test777-dind docker compose up -d
Record a baseline execution that represents the expected behavior:
# Reset MCPProxy state first (see step 1)
./testing/reset-mcpproxy.sh
# Record baseline (output defaults to baselines/{scenario_name}_baseline)
PYTHONPATH=src uv run python -m mcp_eval.cli record --scenario scenarios/search_tools_simple.yaml
# View generated HTML report
open reports/search_tools_simple_baseline_*.html
Execute the current implementation and compare against the baseline:
# Reset MCPProxy state first (see step 1)
./testing/reset-mcpproxy.sh
# Run comparison (output defaults to comparison_results/{scenario_name}_comparison)
PYTHONPATH=src uv run python -m mcp_eval.cli compare --scenario scenarios/search_tools_simple.yaml \
--baseline baselines/search_tools_simple_baseline/search_tools_simple_baseline
# View results
open reports/search_tools_simple_comparison_*.html
HTML Reports provide comprehensive visual analysis:
- Baseline Reports: Complete conversation logs, tool calls, termination analysis, MCPProxy version tracking
- Comparison Reports: Side-by-side current vs baseline execution with trajectory metrics
Key Metrics:
- Tool Trajectory Score: Sophisticated similarity-based comparison of MCP tool usage patterns (0.0-1.0)
- Per-Invocation Analysis: Detailed similarity scores for each individual tool call with visual comparison
- MCP-Only Filtering: Focuses evaluation on MCP tool calls only (excludes TodoWrite, Bash, etc.)
- Multi-Level Similarity: Evaluates tool name matching, argument key similarity, and value similarity using multiple algorithms
- Pass/Fail Threshold: 0.8 (configurable)
The MCP Evaluation System uses sophisticated similarity calculations to compare tool usage patterns between current and baseline executions. This approach provides more nuanced evaluation than simple exact matching.
The system filters comparisons to only MCP tool calls (tools with mcp__
prefix), excluding framework tools like:
TodoWrite
(task management)Bash
(command execution)Read
,Write
,Edit
(file operations)
This ensures evaluation focuses on actual MCP server interactions rather than agent implementation details.
Each tool call comparison evaluates:
- Tool Name Matching: Must be identical (different tools = 0.0 similarity)
- Argument Similarity: 30% key structure + 70% value similarity
- Key Similarity: Jaccard similarity of argument keys
- Value Similarity: Multi-method comparison:
- String values: Word intersection with Jaccard similarity
- Numeric values: Distance-based with configurable thresholds
- JSON objects: Cosine similarity using character frequency vectors
The overall trajectory score averages individual tool call similarities, providing a single metric for execution quality assessment.
- Jaccard Similarity: For set-based comparisons (keys, word sets)
- String Intersection: Word-level comparison for natural language queries
- Distance-Based Numeric: Configurable thresholds for numeric variations
- Cosine Similarity: Character frequency analysis for complex JSON structures
The system includes 19+ comprehensive test scenarios covering all major MCPProxy functionality:
list_all_servers
- Server discoverybasic_tool_search
- Tool discovery with BM25 searchlist_quarantined_servers
- Security quarantine listing
add_simple_server
- Add new MCP serversremove_server
- Remove existing serverscheck_server_logs
- Server log inspection
inspect_quarantined_server
- Detailed security analysisserver_status_check
- Configuration validation
list_registries
- Registry discoverysearch_docker_registry
- Docker registry search
github_tool_discovery
- GitHub tool discovery- And more...
# Record a baseline execution
PYTHONPATH=src uv run python -m mcp_eval.cli record --scenario <scenario_file> [--output <output_dir>]
# Compare against baseline
PYTHONPATH=src uv run python -m mcp_eval.cli compare --scenario <scenario_file> --baseline <baseline_dir> [--output <output_dir>]
# Run multiple scenarios in batch
PYTHONPATH=src uv run python -m mcp_eval.cli batch --scenarios <scenarios_dir> [--output <output_dir>]
# Test runner - pytest-style scenario execution with compact output
PYTHONPATH=src uv run python -m mcp_eval.cli test [--tag <tag>] [--scenario <file>] [--fail-fast]
The test
command provides a pytest-style interface for running MCP scenarios with compact output, automatic MCPProxy state management, and flexible filtering:
# Run all enabled scenarios
PYTHONPATH=src uv run python -m mcp_eval.cli test
# Filter scenarios by tags (security, server_management, tool_discovery, etc.)
PYTHONPATH=src uv run python -m mcp_eval.cli test --tag security --tag quarantine
# Run specific scenario files
PYTHONPATH=src uv run python -m mcp_eval.cli test --scenario scenarios/list_all_servers.yaml --scenario scenarios/search_tools_simple.yaml
# Stop on first failure (like pytest -x)
PYTHONPATH=src uv run python -m mcp_eval.cli test --tag server_management --fail-fast
# Verbose output for debugging
PYTHONPATH=src uv run python -m mcp_eval.cli test --scenario scenarios/debug_scenario.yaml --verbose
Output Format:
🧪 Running 3 scenarios
Filtered by tags: security
list_all_servers PASS 1.00
search_tools_simple FAIL 0.54
new_scenario RECORDED N/A
✅ 1 passed, 1 recorded, 1 failed
Test Runner Features:
- Automatic MCPProxy Restart: Ensures clean state between test runs
- Baseline Comparison: Compares against existing baselines if available, records new ones otherwise
- Tag Filtering: Filter scenarios using tags like
security
,server_management
,tool_discovery
- File Selection: Run specific scenario files instead of entire directories
- Compact Output: Pytest-style output showing scenario name, status (PASS/FAIL/RECORDED/ERROR), and similarity score
- Fail-Fast Mode: Stop execution on first failure for quick debugging
- Status Types:
PASS
: Similarity score ≥ 0.8 compared to baselineFAIL
: Similarity score < 0.8 or execution issuesRECORDED
: New baseline recorded (no existing baseline found)ERROR
: Scenario loading or execution failure
--scenario
: Path to YAML scenario file--baseline
: Path to baseline directory for comparison--output
: Output directory for results--mcp-config
: MCP servers configuration file (default: mcp_servers.json)--tag
: Filter scenarios by tag (can be used multiple times)--fail-fast
: Stop on first failure--verbose
: Enable verbose output for debugging
# Run all unit tests
PYTHONPATH=src uv run python -m pytest tests/ -v
# Run specific test file
PYTHONPATH=src uv run python -m pytest tests/test_similarity.py -v
- Create a YAML file in
scenarios/
directory - Define user intent, expected trajectory, and success criteria
- Optionally create custom config file in
configs/
- Test with baseline recording
Example scenario structure:
enabled: true
name: "My Test Scenario"
description: "Test description"
config_file: "configs/minimal_config.json"
user_intent: "What the user wants to accomplish"
expected_trajectory:
- action: "tool_action"
tool: "mcp__tool_name"
args:
parameter: "value"
success_criteria:
- "keyword_in_response"
- "expected_behavior"
tags:
- "category"
-
MCPProxy container fails to start
# Check Docker is running docker info # Verify MCPProxy source exists ls $MCPPROXY_SOURCE_PATH # Check container logs cd testing/docker && docker compose logs
-
Tool discovery fails
- This is normal and handled gracefully
- Tool discovery failure doesn't affect scenario execution
- MCP tools remain functional during conversations
-
Permission errors
# Ensure scripts are executable chmod +x testing/reset-mcpproxy.sh chmod +x testing/build-mcpproxy.sh
Enable detailed logging by setting environment variables:
export LOG_LEVEL=debug
export PYTHONPATH=src
uv run python -m mcp_eval.cli record --scenario scenarios/your_scenario.yaml
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
MIT License - see LICENSE file for details.
- Issues: Report bugs and feature requests via GitHub Issues
- Documentation: See
CLAUDE.md
for detailed implementation notes - Examples: Check
scenarios/
directory for usage examples