Benchmark System

📊 Claude Flow Benchmark System

🚀 Overview

The Claude Flow Benchmark System provides comprehensive performance testing and analysis for AI agent orchestration, swarm coordination, and development workflows. It enables systematic evaluation of different coordination strategies, agent configurations, and task execution patterns.

Key Update (v2.0.0-alpha.88): The benchmark CLI now uses real claude-flow execution by default. The real subcommand is deprecated - use direct commands: swarm, hive-mind, and sparc.

🎯 Key Features

Swarm Performance Testing - Evaluate multi-agent coordination effectiveness
SPARC Mode Benchmarking - Test-driven development workflow analysis
Hive-Mind Intelligence Testing - Collective decision-making performance
Real Execution Metrics - Token usage, execution time, agent efficiency
Comparative Analysis - Side-by-side strategy comparison
Non-Interactive Mode - Full automation for CI/CD pipelines
Comprehensive Reporting - HTML, JSON, and Markdown outputs

📦 Installation

Prerequisites

Node.js 18+
Python 3.8+
Claude CLI (for real swarm execution)
8GB+ RAM recommended for large swarms

Setup

# Clone the repository
git clone https://github.com/ruvnet/claude-flow
cd claude-flow/benchmark

# Install Python dependencies
pip install -r requirements.txt

# Install Claude Flow CLI
npm install -g claude-flow@alpha

# Verify installation
swarm-bench --version

🖥️ Swarm-Bench CLI

The swarm-bench CLI is the primary interface for running benchmarks.

Basic Usage

# Run a swarm benchmark (always uses real claude-flow)
swarm-bench swarm "Build a REST API" --strategy development

# Run with specific agent count
swarm-bench swarm "Create authentication system" --max-agents 8

# Run hive-mind benchmark
swarm-bench hive-mind "Design architecture" --max-workers 8

# Run SPARC mode (mode before task)
swarm-bench sparc coder "Create user service"

# Run with parallel execution and monitoring
swarm-bench swarm "Implement database schema" --parallel --monitor

# List recent benchmark runs
swarm-bench list

# Show details for a specific run
swarm-bench show <run-id>

# Use mock execution for testing (without claude-flow)
swarm-bench run "Test objective" --mock

Command Structure

swarm-bench [command] [options]

Commands

Command	Description	Example
`swarm`	Run real swarm benchmarks	`swarm-bench swarm "Build API" --strategy development`
`hive-mind`	Run hive-mind benchmarks	`swarm-bench hive-mind "Design system" --max-workers 8`
`sparc`	Run SPARC mode benchmarks	`swarm-bench sparc coder "Implement feature"`
`run`	Run benchmark (with --mock option)	`swarm-bench run "Test" --mock`
`list`	List recent benchmark runs	`swarm-bench list`
`show`	Show details for a run	`swarm-bench show <run-id>`
`serve`	Start web interface	`swarm-bench serve`
`clean`	Clean up results	`swarm-bench clean`

🔧 CLI Options

Global Options

-v, --verbose            # Enable verbose output
-c, --config PATH        # Configuration file path
--mock                   # Use mock execution instead of real claude-flow
--claude-flow-path TEXT  # Path to claude-flow (default: ./claude-flow)
--timeout INTEGER        # Command timeout in seconds
--stream / --no-stream   # Stream output
--version               # Show version
--help                  # Show help message

Swarm Command Options

swarm-bench swarm OBJECTIVE [OPTIONS]

Positional:
  OBJECTIVE              # The goal or task for claude-flow (required)

Options:
  --strategy [auto|research|development|analysis|testing|optimization|maintenance]
  --mode [centralized|distributed|hierarchical|mesh|hybrid]
  --sparc-mode TEXT     # Specific SPARC mode to test
  --all-modes          # Test all SPARC modes and swarm strategies
  --max-agents INTEGER  # Maximum agents (default: 5)
  --timeout INTEGER     # Timeout in minutes (default: 60)
  --task-timeout INTEGER # Individual task timeout in seconds (default: 300)
  --parallel           # Enable parallel execution
  --monitor            # Enable monitoring
  -o, --output [json|sqlite]  # Output formats
  --output-dir PATH     # Output directory (default: ./reports)
  --name TEXT           # Benchmark name
  --description TEXT    # Benchmark description

Hive-Mind Command Options

swarm-bench hive-mind TASK [OPTIONS]

Positional:
  TASK                   # The task for the hive-mind (required)

Options:
  --queen-type [strategic|tactical|adaptive]  # Queen type (default: strategic)
  --max-workers INTEGER  # Maximum worker agents (default: 8)
  --consensus [majority|weighted|byzantine]  # Consensus algorithm (default: majority)
  --timeout INTEGER     # Timeout in minutes (default: 60)
  --monitor            # Enable monitoring
  --output-dir PATH    # Output directory (default: ./reports)

SPARC Command Options

swarm-bench sparc MODE TASK [OPTIONS]

Positional:
  MODE                  # SPARC mode: coder|architect|tdd|reviewer|tester|optimizer|documenter|debugger
  TASK                  # The task to accomplish (required)

Options:
  --namespace TEXT      # Memory namespace
  --timeout INTEGER     # Timeout in minutes (default: 60)
  --non-interactive    # Non-interactive mode (default: True)
  --output-dir PATH    # Output directory (default: ./reports)

📊 Benchmark Modes

1. Swarm Mode

Tests multi-agent coordination with different topologies:

# Hierarchical topology
swarm-bench swarm "Build e-commerce platform" \
  --mode hierarchical --max-agents 8

# Mesh topology for collaborative tasks
swarm-bench swarm "Research and implement ML pipeline" \
  --mode mesh --max-agents 5

# Distributed mode with monitoring
swarm-bench swarm "Refactor legacy codebase" \
  --mode distributed --monitor

# Development strategy with parallel execution
swarm-bench swarm "Build authentication system" \
  --strategy development --max-agents 6 --parallel

# Test all SPARC modes
swarm-bench swarm "Shopping cart feature" \
  --all-modes --parallel

2. Hive-Mind Mode

Tests collective intelligence and consensus:

# Basic hive-mind benchmark
swarm-bench hive-mind "Design microservices architecture" \
  --max-workers 8

# With adaptive queen type
swarm-bench hive-mind "Solve complex problem" \
  --queen-type adaptive --max-workers 10

# Byzantine consensus for critical tasks
swarm-bench hive-mind "Critical security audit" \
  --consensus byzantine --max-workers 7 --monitor

3. SPARC Mode

Tests different SPARC development phases:

# Run SPARC benchmark (mode is positional argument before task)
swarm-bench sparc coder "Create user service"

# Different SPARC modes
swarm-bench sparc architect "Design microservices"
swarm-bench sparc tdd "Create test suite"
swarm-bench sparc reviewer "Review codebase"
swarm-bench sparc optimizer "Optimize performance"
swarm-bench sparc debugger "Fix critical bug"

# With memory namespace
swarm-bench sparc coder "Implement authentication" \
  --namespace auth-project --timeout 120

📈 Metrics and Analysis

Core Metrics

Metric	Description	Unit
`execution_time`	Total execution duration	seconds
`input_tokens`	Tokens consumed by inputs	count
`output_tokens`	Tokens generated	count
`agents_spawned`	Number of agents created	count
`tasks_completed`	Successfully completed tasks	count
`coordination_efficiency`	Inter-agent coordination score	0-100%
`memory_usage`	Peak memory consumption	MB
`error_rate`	Task failure percentage	%

Performance Indicators

# Example metric thresholds
{
  "execution_time": {"max": 300, "target": 150},
  "token_efficiency": {"min": 0.7, "target": 0.85},
  "success_rate": {"min": 0.95, "target": 0.99},
  "coordination_score": {"min": 80, "target": 90}
}

🔄 Configuration

YAML Configuration

Create benchmark-config.yaml:

benchmark:
  name: "API Development Benchmark"
  description: "Test REST API creation efficiency"
  
  defaults:
    mode: swarm
    agents: 6
    timeout: 1800
    iterations: 3
    
  tasks:
    - name: "Simple CRUD API"
      description: "Build basic CRUD endpoints"
      complexity: low
      expected_time: 120
      
    - name: "Authenticated API" 
      description: "API with JWT authentication"
      complexity: medium
      expected_time: 300
      
    - name: "Microservices API"
      description: "Multi-service architecture"
      complexity: high
      expected_time: 600
      
  strategies:
    - hierarchical
    - mesh
    - adaptive
    
  metrics:
    collect:
      - execution_time
      - token_usage
      - agent_efficiency
      - coordination_score
    
  output:
    format: html
    include_charts: true
    export_raw: true

Python Configuration

from swarm_benchmark import BenchmarkConfig, SwarmBenchmark

config = BenchmarkConfig(
    name="Custom Benchmark",
    mode="swarm",
    default_agents=6,
    timeout=1800,
    strategies=["hierarchical", "mesh"],
    metrics=["execution_time", "token_usage"]
)

benchmark = SwarmBenchmark(config)
results = benchmark.run(task="Build authentication system")

🏃 Running Benchmarks

Quick Start Examples

# 1. Simple swarm benchmark
swarm-bench swarm "Create TODO app with React" --strategy development

# 2. Hive-mind benchmark
swarm-bench hive-mind "Design payment system" --max-workers 8

# 3. SPARC mode benchmark
swarm-bench sparc coder "Build authentication"

# 4. CI/CD integration with parallel execution
swarm-bench swarm "$TASK" \
  --parallel \
  --output json \
  --output-dir ./results

Advanced Usage

# Swarm with all SPARC modes testing
swarm-bench swarm "Build microservices platform" \
  --all-modes \
  --parallel \
  --timeout 120 \
  --max-agents 10

# Hive-mind with Byzantine consensus
swarm-bench hive-mind "Performance-critical API" \
  --consensus byzantine \
  --max-workers 12 \
  --queen-type tactical

# SPARC with namespace and extended timeout
swarm-bench sparc architect "Complex system refactoring" \
  --namespace project-v2 \
  --timeout 180

# Custom configuration with verbose output
swarm-bench swarm "Enterprise application" \
  -c custom-config.yaml \
  --strategy optimization \
  --max-agents 15 \
  -v

📊 Output Formats

HTML Report

HTML output format is available through the swarm command:

swarm-bench run "Build API" --output html --output-dir ./reports

Note: The HTML output is currently basic. Enhanced visualization features are planned for future releases.

JSON Output

Structured data for programmatic analysis:

{
  "benchmark": {
    "id": "bench-2024-01-15-001",
    "task": "Build REST API",
    "mode": "swarm",
    "timestamp": "2024-01-15T10:30:00Z"
  },
  "results": {
    "execution_time": 245.3,
    "input_tokens": 15234,
    "output_tokens": 48291,
    "agents_spawned": 6,
    "tasks_completed": 12,
    "success_rate": 1.0
  },
  "analysis": {
    "efficiency_score": 87.5,
    "coordination_score": 92.3,
    "recommendations": [...]
  }
}

Viewing Results

To view benchmark results, use the show command:

swarm-bench show <benchmark-id> --format json

🔌 CI/CD Integration

GitHub Actions

name: Benchmark Suite

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM
  workflow_dispatch:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
        
    - name: Install dependencies
      run: |
        pip install -r benchmark/requirements.txt
        npm install -g claude-flow@alpha
        
    - name: Run benchmarks
      run: |
        cd benchmark
        swarm-bench swarm "Run test suite" \
          --strategy auto \
          --output json \
          --output-dir ./results
          
    - name: Upload results
      uses: actions/upload-artifact@v3
      with:
        name: benchmark-results
        path: benchmark/results/
        
    - name: List benchmark results
      run: |
        cd benchmark
        swarm-bench list --format json > results-list.json
        
    - name: Comment on PR
      if: github.event_name == 'pull_request'
      run: |
        cd benchmark
        echo "## Benchmark Results" > comment.md
        swarm-bench list >> comment.md
        gh pr comment $PR_NUMBER --body-file comment.md

Docker Support

FROM python:3.10-slim

WORKDIR /benchmark

# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Install Node.js and Claude Flow
RUN apt-get update && apt-get install -y nodejs npm
RUN npm install -g claude-flow@alpha

# Copy benchmark code
COPY . .

# Run benchmark
ENTRYPOINT ["swarm-bench"]
CMD ["swarm", "Default benchmark", "--strategy", "auto"]

🧪 Testing Strategies

Performance Testing

# Load testing with increasing agent counts
for agents in 2 4 6 8 10 12; do
  swarm-bench swarm "Scale test" --max-agents $agents \
    --output json --output-dir ./results
done

# Stress testing with complex tasks
swarm-bench swarm "Build complete SaaS platform" \
  --max-agents 12 --timeout 120 --mode distributed

Regression Testing

# Run benchmark and save results
swarm-bench swarm "Standard benchmark suite" \
  --output json --output-dir ./baseline

# Run current version
swarm-bench swarm "Standard benchmark suite" \
  --output json --output-dir ./current

# Compare results manually or with custom scripts

📈 Best Practices

1. Task Design

Use consistent, measurable tasks
Include complexity indicators
Define clear success criteria
Provide adequate context

2. Agent Configuration

Start with auto-detection
Scale based on task complexity
Balance agent types
Monitor coordination overhead

3. Metric Selection

Focus on relevant metrics
Set realistic thresholds
Track trends over time
Consider cost vs performance

4. Automation

Use non-interactive mode for CI/CD
Implement regression checks
Archive historical results
Generate automated reports

🔍 Troubleshooting

Common Issues

Issue	Solution
Timeout errors	Increase `--timeout` or optimize task
High token usage	Reduce agent count or use efficient strategies
Low coordination score	Switch to mesh or adaptive topology
Inconsistent results	Increase iterations for statistical significance

Debug Mode

# Enable detailed logging
swarm-bench swarm "Debug test" -v

# Run with mock execution for testing
swarm-bench run "Test task" --mock

# Check configuration
swarm-bench --version
swarm-bench --help

📚 Advanced Topics

Custom Executors

from swarm_benchmark.core import BaseExecutor

class CustomExecutor(BaseExecutor):
    def execute_swarm(self, config):
        # Custom swarm execution logic
        pass
    
    def execute_sparc(self, config):
        # Custom SPARC execution logic
        pass

# Register custom executor
benchmark = SwarmBenchmark(executor=CustomExecutor())

Statistical Analysis

from swarm_benchmark.analysis import StatisticalAnalyzer

analyzer = StatisticalAnalyzer()
results = analyzer.compare_strategies(
    data=benchmark_results,
    test="anova",
    confidence=0.95
)

print(f"P-value: {results.p_value}")
print(f"Significant difference: {results.is_significant}")

🔗 Related Documentation

Benchmarking Guides

SWE-bench Evaluation - Official software engineering benchmark
Performance Benchmarking - Comprehensive performance testing guide
Agent Performance Analysis - Understanding agent efficiency
Multi-Mode Comparison - Comparing execution strategies

Core System Documentation

📞 Support

GitHub Issues: Report bugs or request features
Documentation: Full documentation
Examples: Benchmark examples

Claude Flow Benchmark System - Measure, analyze, and optimize AI agent performance

Benchmark System

📊 Claude Flow Benchmark System

🚀 Overview

🎯 Key Features

📦 Installation

Prerequisites

Setup

🖥️ Swarm-Bench CLI

Basic Usage

Command Structure

Commands

🔧 CLI Options

Global Options

Swarm Command Options

Hive-Mind Command Options

SPARC Command Options

📊 Benchmark Modes

1. Swarm Mode

2. Hive-Mind Mode

3. SPARC Mode

📈 Metrics and Analysis

Core Metrics

Performance Indicators

🔄 Configuration

YAML Configuration

Python Configuration

🏃 Running Benchmarks

Quick Start Examples

Advanced Usage

📊 Output Formats

HTML Report

JSON Output

Viewing Results

🔌 CI/CD Integration

GitHub Actions

Docker Support

🧪 Testing Strategies

Performance Testing

Regression Testing

📈 Best Practices

1. Task Design

2. Agent Configuration

3. Metric Selection

4. Automation

🔍 Troubleshooting

Common Issues

Debug Mode

📚 Advanced Topics

Custom Executors

Statistical Analysis

🔗 Related Documentation

Benchmarking Guides

Core System Documentation

📞 Support

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!