-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Benchmark System
The Claude Flow Benchmark System provides comprehensive performance testing and analysis for AI agent orchestration, swarm coordination, and development workflows. It enables systematic evaluation of different coordination strategies, agent configurations, and task execution patterns.
Key Update (v2.0.0-alpha.88): The benchmark CLI now uses real claude-flow execution by default. The real
subcommand is deprecated - use direct commands: swarm
, hive-mind
, and sparc
.
- Swarm Performance Testing - Evaluate multi-agent coordination effectiveness
- SPARC Mode Benchmarking - Test-driven development workflow analysis
- Hive-Mind Intelligence Testing - Collective decision-making performance
- Real Execution Metrics - Token usage, execution time, agent efficiency
- Comparative Analysis - Side-by-side strategy comparison
- Non-Interactive Mode - Full automation for CI/CD pipelines
- Comprehensive Reporting - HTML, JSON, and Markdown outputs
- Node.js 18+
- Python 3.8+
- Claude CLI (for real swarm execution)
- 8GB+ RAM recommended for large swarms
# Clone the repository
git clone https://github.com/ruvnet/claude-flow
cd claude-flow/benchmark
# Install Python dependencies
pip install -r requirements.txt
# Install Claude Flow CLI
npm install -g claude-flow@alpha
# Verify installation
swarm-bench --version
The swarm-bench
CLI is the primary interface for running benchmarks.
# Run a swarm benchmark (always uses real claude-flow)
swarm-bench swarm "Build a REST API" --strategy development
# Run with specific agent count
swarm-bench swarm "Create authentication system" --max-agents 8
# Run hive-mind benchmark
swarm-bench hive-mind "Design architecture" --max-workers 8
# Run SPARC mode (mode before task)
swarm-bench sparc coder "Create user service"
# Run with parallel execution and monitoring
swarm-bench swarm "Implement database schema" --parallel --monitor
# List recent benchmark runs
swarm-bench list
# Show details for a specific run
swarm-bench show <run-id>
# Use mock execution for testing (without claude-flow)
swarm-bench run "Test objective" --mock
swarm-bench [command] [options]
Command | Description | Example |
---|---|---|
swarm |
Run real swarm benchmarks | swarm-bench swarm "Build API" --strategy development |
hive-mind |
Run hive-mind benchmarks | swarm-bench hive-mind "Design system" --max-workers 8 |
sparc |
Run SPARC mode benchmarks | swarm-bench sparc coder "Implement feature" |
run |
Run benchmark (with --mock option) | swarm-bench run "Test" --mock |
list |
List recent benchmark runs | swarm-bench list |
show |
Show details for a run | swarm-bench show <run-id> |
serve |
Start web interface | swarm-bench serve |
clean |
Clean up results | swarm-bench clean |
-v, --verbose # Enable verbose output
-c, --config PATH # Configuration file path
--mock # Use mock execution instead of real claude-flow
--claude-flow-path TEXT # Path to claude-flow (default: ./claude-flow)
--timeout INTEGER # Command timeout in seconds
--stream / --no-stream # Stream output
--version # Show version
--help # Show help message
swarm-bench swarm OBJECTIVE [OPTIONS]
Positional:
OBJECTIVE # The goal or task for claude-flow (required)
Options:
--strategy [auto|research|development|analysis|testing|optimization|maintenance]
--mode [centralized|distributed|hierarchical|mesh|hybrid]
--sparc-mode TEXT # Specific SPARC mode to test
--all-modes # Test all SPARC modes and swarm strategies
--max-agents INTEGER # Maximum agents (default: 5)
--timeout INTEGER # Timeout in minutes (default: 60)
--task-timeout INTEGER # Individual task timeout in seconds (default: 300)
--parallel # Enable parallel execution
--monitor # Enable monitoring
-o, --output [json|sqlite] # Output formats
--output-dir PATH # Output directory (default: ./reports)
--name TEXT # Benchmark name
--description TEXT # Benchmark description
swarm-bench hive-mind TASK [OPTIONS]
Positional:
TASK # The task for the hive-mind (required)
Options:
--queen-type [strategic|tactical|adaptive] # Queen type (default: strategic)
--max-workers INTEGER # Maximum worker agents (default: 8)
--consensus [majority|weighted|byzantine] # Consensus algorithm (default: majority)
--timeout INTEGER # Timeout in minutes (default: 60)
--monitor # Enable monitoring
--output-dir PATH # Output directory (default: ./reports)
swarm-bench sparc MODE TASK [OPTIONS]
Positional:
MODE # SPARC mode: coder|architect|tdd|reviewer|tester|optimizer|documenter|debugger
TASK # The task to accomplish (required)
Options:
--namespace TEXT # Memory namespace
--timeout INTEGER # Timeout in minutes (default: 60)
--non-interactive # Non-interactive mode (default: True)
--output-dir PATH # Output directory (default: ./reports)
Tests multi-agent coordination with different topologies:
# Hierarchical topology
swarm-bench swarm "Build e-commerce platform" \
--mode hierarchical --max-agents 8
# Mesh topology for collaborative tasks
swarm-bench swarm "Research and implement ML pipeline" \
--mode mesh --max-agents 5
# Distributed mode with monitoring
swarm-bench swarm "Refactor legacy codebase" \
--mode distributed --monitor
# Development strategy with parallel execution
swarm-bench swarm "Build authentication system" \
--strategy development --max-agents 6 --parallel
# Test all SPARC modes
swarm-bench swarm "Shopping cart feature" \
--all-modes --parallel
Tests collective intelligence and consensus:
# Basic hive-mind benchmark
swarm-bench hive-mind "Design microservices architecture" \
--max-workers 8
# With adaptive queen type
swarm-bench hive-mind "Solve complex problem" \
--queen-type adaptive --max-workers 10
# Byzantine consensus for critical tasks
swarm-bench hive-mind "Critical security audit" \
--consensus byzantine --max-workers 7 --monitor
Tests different SPARC development phases:
# Run SPARC benchmark (mode is positional argument before task)
swarm-bench sparc coder "Create user service"
# Different SPARC modes
swarm-bench sparc architect "Design microservices"
swarm-bench sparc tdd "Create test suite"
swarm-bench sparc reviewer "Review codebase"
swarm-bench sparc optimizer "Optimize performance"
swarm-bench sparc debugger "Fix critical bug"
# With memory namespace
swarm-bench sparc coder "Implement authentication" \
--namespace auth-project --timeout 120
Metric | Description | Unit |
---|---|---|
execution_time |
Total execution duration | seconds |
input_tokens |
Tokens consumed by inputs | count |
output_tokens |
Tokens generated | count |
agents_spawned |
Number of agents created | count |
tasks_completed |
Successfully completed tasks | count |
coordination_efficiency |
Inter-agent coordination score | 0-100% |
memory_usage |
Peak memory consumption | MB |
error_rate |
Task failure percentage | % |
# Example metric thresholds
{
"execution_time": {"max": 300, "target": 150},
"token_efficiency": {"min": 0.7, "target": 0.85},
"success_rate": {"min": 0.95, "target": 0.99},
"coordination_score": {"min": 80, "target": 90}
}
Create benchmark-config.yaml
:
benchmark:
name: "API Development Benchmark"
description: "Test REST API creation efficiency"
defaults:
mode: swarm
agents: 6
timeout: 1800
iterations: 3
tasks:
- name: "Simple CRUD API"
description: "Build basic CRUD endpoints"
complexity: low
expected_time: 120
- name: "Authenticated API"
description: "API with JWT authentication"
complexity: medium
expected_time: 300
- name: "Microservices API"
description: "Multi-service architecture"
complexity: high
expected_time: 600
strategies:
- hierarchical
- mesh
- adaptive
metrics:
collect:
- execution_time
- token_usage
- agent_efficiency
- coordination_score
output:
format: html
include_charts: true
export_raw: true
from swarm_benchmark import BenchmarkConfig, SwarmBenchmark
config = BenchmarkConfig(
name="Custom Benchmark",
mode="swarm",
default_agents=6,
timeout=1800,
strategies=["hierarchical", "mesh"],
metrics=["execution_time", "token_usage"]
)
benchmark = SwarmBenchmark(config)
results = benchmark.run(task="Build authentication system")
# 1. Simple swarm benchmark
swarm-bench swarm "Create TODO app with React" --strategy development
# 2. Hive-mind benchmark
swarm-bench hive-mind "Design payment system" --max-workers 8
# 3. SPARC mode benchmark
swarm-bench sparc coder "Build authentication"
# 4. CI/CD integration with parallel execution
swarm-bench swarm "$TASK" \
--parallel \
--output json \
--output-dir ./results
# Swarm with all SPARC modes testing
swarm-bench swarm "Build microservices platform" \
--all-modes \
--parallel \
--timeout 120 \
--max-agents 10
# Hive-mind with Byzantine consensus
swarm-bench hive-mind "Performance-critical API" \
--consensus byzantine \
--max-workers 12 \
--queen-type tactical
# SPARC with namespace and extended timeout
swarm-bench sparc architect "Complex system refactoring" \
--namespace project-v2 \
--timeout 180
# Custom configuration with verbose output
swarm-bench swarm "Enterprise application" \
-c custom-config.yaml \
--strategy optimization \
--max-agents 15 \
-v
HTML output format is available through the swarm command:
swarm-bench run "Build API" --output html --output-dir ./reports
Note: The HTML output is currently basic. Enhanced visualization features are planned for future releases.
Structured data for programmatic analysis:
{
"benchmark": {
"id": "bench-2024-01-15-001",
"task": "Build REST API",
"mode": "swarm",
"timestamp": "2024-01-15T10:30:00Z"
},
"results": {
"execution_time": 245.3,
"input_tokens": 15234,
"output_tokens": 48291,
"agents_spawned": 6,
"tasks_completed": 12,
"success_rate": 1.0
},
"analysis": {
"efficiency_score": 87.5,
"coordination_score": 92.3,
"recommendations": [...]
}
}
To view benchmark results, use the show
command:
swarm-bench show <benchmark-id> --format json
name: Benchmark Suite
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
workflow_dispatch:
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install -r benchmark/requirements.txt
npm install -g claude-flow@alpha
- name: Run benchmarks
run: |
cd benchmark
swarm-bench swarm "Run test suite" \
--strategy auto \
--output json \
--output-dir ./results
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: benchmark/results/
- name: List benchmark results
run: |
cd benchmark
swarm-bench list --format json > results-list.json
- name: Comment on PR
if: github.event_name == 'pull_request'
run: |
cd benchmark
echo "## Benchmark Results" > comment.md
swarm-bench list >> comment.md
gh pr comment $PR_NUMBER --body-file comment.md
FROM python:3.10-slim
WORKDIR /benchmark
# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Install Node.js and Claude Flow
RUN apt-get update && apt-get install -y nodejs npm
RUN npm install -g claude-flow@alpha
# Copy benchmark code
COPY . .
# Run benchmark
ENTRYPOINT ["swarm-bench"]
CMD ["swarm", "Default benchmark", "--strategy", "auto"]
# Load testing with increasing agent counts
for agents in 2 4 6 8 10 12; do
swarm-bench swarm "Scale test" --max-agents $agents \
--output json --output-dir ./results
done
# Stress testing with complex tasks
swarm-bench swarm "Build complete SaaS platform" \
--max-agents 12 --timeout 120 --mode distributed
# Run benchmark and save results
swarm-bench swarm "Standard benchmark suite" \
--output json --output-dir ./baseline
# Run current version
swarm-bench swarm "Standard benchmark suite" \
--output json --output-dir ./current
# Compare results manually or with custom scripts
- Use consistent, measurable tasks
- Include complexity indicators
- Define clear success criteria
- Provide adequate context
- Start with auto-detection
- Scale based on task complexity
- Balance agent types
- Monitor coordination overhead
- Focus on relevant metrics
- Set realistic thresholds
- Track trends over time
- Consider cost vs performance
- Use non-interactive mode for CI/CD
- Implement regression checks
- Archive historical results
- Generate automated reports
Issue | Solution |
---|---|
Timeout errors | Increase --timeout or optimize task |
High token usage | Reduce agent count or use efficient strategies |
Low coordination score | Switch to mesh or adaptive topology |
Inconsistent results | Increase iterations for statistical significance |
# Enable detailed logging
swarm-bench swarm "Debug test" -v
# Run with mock execution for testing
swarm-bench run "Test task" --mock
# Check configuration
swarm-bench --version
swarm-bench --help
from swarm_benchmark.core import BaseExecutor
class CustomExecutor(BaseExecutor):
def execute_swarm(self, config):
# Custom swarm execution logic
pass
def execute_sparc(self, config):
# Custom SPARC execution logic
pass
# Register custom executor
benchmark = SwarmBenchmark(executor=CustomExecutor())
from swarm_benchmark.analysis import StatisticalAnalyzer
analyzer = StatisticalAnalyzer()
results = analyzer.compare_strategies(
data=benchmark_results,
test="anova",
confidence=0.95
)
print(f"P-value: {results.p_value}")
print(f"Significant difference: {results.is_significant}")
- SWE-bench Evaluation - Official software engineering benchmark
- Performance Benchmarking - Comprehensive performance testing guide
- Agent Performance Analysis - Understanding agent efficiency
- Multi-Mode Comparison - Comparing execution strategies
- Swarm Orchestration
- Agent System Overview
- SPARC Methodology
- Performance Optimization
- Non-Interactive Mode
- GitHub Issues: Report bugs or request features
- Documentation: Full documentation
- Examples: Benchmark examples
Claude Flow Benchmark System - Measure, analyze, and optimize AI agent performance