-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Performance Benchmarking
Claude-flow includes comprehensive benchmarking capabilities to evaluate and optimize AI agent performance across different execution modes, coordination strategies, and task types. This guide covers all benchmarking features and best practices.
graph TD
A[Benchmark Engine] --> B[Task Generator]
A --> C[Execution Monitor]
A --> D[Metrics Collector]
A --> E[Report Generator]
B --> F[Synthetic Tasks]
B --> G[Real-World Tasks]
B --> H[SWE-bench Tasks]
C --> I[Agent Coordination]
C --> J[Resource Usage]
C --> K[Error Tracking]
D --> L[Performance Metrics]
D --> M[Success Rates]
D --> N[Timing Data]
E --> O[HTML Reports]
E --> P[JSON Data]
E --> Q[Comparison Analysis]
- Synthetic Benchmarks: Controlled tasks for consistent measurement
- Real-World Benchmarks: Actual software engineering tasks
- SWE-bench Integration: Official academic benchmark
- Load Testing: High-concurrency performance evaluation
- Stress Testing: Resource limit and failure mode analysis
# Run standard benchmark suite
swarm-bench run --strategy development --mode centralized
# Compare multiple configurations
swarm-bench run --strategy development,optimization,research --mode centralized,distributed
# Load test with high concurrency
swarm-bench load-test --agents 50 --tasks 100
# Quick SWE-bench test
swarm-bench swe-bench official --limit 10
# Multi-mode comparison
swarm-bench swe-bench multi-mode --instances 5
# Full evaluation
swarm-bench swe-bench official --lite
swarm-bench run [OPTIONS]
Options:
-
--strategy STRATEGY
: Execution strategy (development, optimization, research, testing) -
--mode MODE
: Coordination mode (centralized, distributed, hierarchical, mesh) -
--agents N
: Number of agents (1-50) -
--tasks N
: Number of tasks to execute -
--timeout N
: Task timeout in seconds -
--iterations N
: Number of benchmark iterations -
--output DIR
: Output directory for results -
--config FILE
: Custom configuration file
Examples:
# Basic performance test
swarm-bench run --strategy development --mode centralized --agents 5
# Comprehensive evaluation
swarm-bench run --strategy development,optimization --mode centralized,distributed --agents 3,5,8
# Custom configuration
swarm-bench run --config benchmark-config.yaml --iterations 10
swarm-bench load-test [OPTIONS]
Options:
-
--agents N
: Concurrent agents (default: 10) -
--tasks N
: Total tasks (default: 100) -
--duration N
: Test duration in seconds -
--ramp-up N
: Ramp-up time in seconds -
--target-rate N
: Target tasks per second
Examples:
# Standard load test
swarm-bench load-test --agents 20 --tasks 200
# Sustained load test
swarm-bench load-test --duration 300 --target-rate 5
# Stress test
swarm-bench load-test --agents 100 --tasks 1000
swarm-bench compare [OPTIONS] RESULT1 RESULT2 [RESULT3...]
Options:
-
--metric METRIC
: Primary comparison metric (success_rate, duration, throughput) -
--format FORMAT
: Output format (table, json, html) -
--output FILE
: Save comparison report
Examples:
# Compare two benchmark runs
swarm-bench compare results/run1.json results/run2.json
# Generate HTML comparison report
swarm-bench compare --format html --output comparison.html results/*.json
# benchmark-config.yaml
benchmark:
name: "Custom Performance Test"
description: "Testing optimal configurations"
tasks:
types: ["coding", "analysis", "debugging"]
difficulty: ["easy", "medium", "hard"]
count: 50
timeout: 300
execution:
strategies: ["development", "optimization"]
modes: ["centralized", "distributed", "mesh"]
agent_counts: [3, 5, 8, 12]
max_retries: 3
output:
directory: "benchmark/results"
formats: ["json", "html", "csv"]
include_logs: true
reporting:
metrics:
- "success_rate"
- "average_duration"
- "throughput"
- "error_rate"
- "resource_usage"
charts:
- "performance_trends"
- "agent_efficiency"
- "error_distribution"
# agent-profiles.yaml
profiles:
lightweight:
max_agents: 3
coordination: "centralized"
timeout: 120
memory_limit: "256MB"
standard:
max_agents: 5
coordination: "distributed"
timeout: 300
memory_limit: "512MB"
high_performance:
max_agents: 8
coordination: "mesh"
timeout: 600
memory_limit: "1GB"
maximum:
max_agents: 12
coordination: "hierarchical"
timeout: 900
memory_limit: "2GB"
Success Rate
- Definition: Percentage of tasks completed successfully
- Target: >85% for production workloads
- Calculation:
successful_tasks / total_tasks * 100
Average Duration
- Definition: Mean time to complete tasks
- Target: <5 minutes for simple tasks, <15 minutes for complex
- Includes: Queue time + execution time + coordination overhead
Throughput
- Definition: Tasks completed per unit time
- Target: >10 tasks/hour for development workflows
- Calculation:
completed_tasks / total_time_hours
Agent Efficiency
- Definition: Percentage of time agents spend on productive work
- Target: >70% utilization
- Factors: Coordination overhead, idle time, resource contention
Resource Utilization
- CPU Usage: Target <80% average, <95% peak
- Memory Usage: Target <70% of available memory
- Network I/O: Bandwidth utilization for distributed coordination
Code Quality (for software engineering tasks)
- Syntax correctness
- Logical soundness
- Best practices adherence
- Test coverage
Solution Accuracy
- Requirement fulfillment
- Edge case handling
- Error handling robustness
Coordination Effectiveness
- Message passing efficiency
- Consensus achievement time
- Conflict resolution success
Agent Count Optimization
# Find optimal agent count for your workload
swarm-bench optimize --metric throughput --param agents --range 1,20
# Results example:
# 3 agents: 12.5 tasks/hour
# 5 agents: 18.3 tasks/hour ← Optimal
# 8 agents: 16.2 tasks/hour (coordination overhead)
Coordination Mode Selection
# Test different coordination modes
swarm-bench run --mode centralized,distributed,hierarchical,mesh --agents 5
# Performance comparison:
# centralized: Fast for simple tasks, bottlenecks at scale
# distributed: Good scalability, higher latency
# hierarchical: Balanced, good for complex workflows
# mesh: Highest throughput, most resource intensive
Strategy Optimization
# Compare strategies for your task type
swarm-bench run --strategy development,optimization,research,testing
# Task-specific recommendations:
# development: General purpose, good baseline
# optimization: Best performance for complex tasks
# research: Better for exploratory/analytical work
# testing: Focused on validation and quality
Memory Optimization
# Monitor memory usage during benchmarks
swarm-bench run --monitor-memory --agents 8
# Recommended settings:
# JVM heap: -Xmx2G (adjust based on agent count)
# Node.js: --max-old-space-size=4096
# Python: Set appropriate limits for worker processes
Concurrency Tuning
# Test different concurrency levels
swarm-bench load-test --agents 10,20,30,40,50
# Find the sweet spot between throughput and stability
# .github/workflows/benchmark.yml
name: Performance Benchmark
on:
push:
branches: [main]
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install dependencies
run: npm install
- name: Build claude-flow
run: npm run build
- name: Run performance benchmarks
run: |
swarm-bench run --agents 5 --iterations 3
swarm-bench swe-bench official --limit 10
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: benchmark/results/
#!/bin/bash
# performance-regression.sh
# Run current benchmark
swarm-bench run --config baseline-config.yaml --output current/
# Compare with baseline
swarm-bench compare baseline/results.json current/results.json --format json --output comparison.json
# Check for regression (>10% performance decrease)
python scripts/check-regression.py comparison.json --threshold 0.10
if [ $? -ne 0 ]; then
echo "Performance regression detected!"
exit 1
fi
# custom_tasks.py
from benchmark.src.swarm_benchmark.core.task_generator import TaskGenerator
class CustomTaskGenerator(TaskGenerator):
def generate_coding_task(self, difficulty="medium"):
return {
"type": "coding",
"difficulty": difficulty,
"description": "Implement a REST API endpoint",
"requirements": [
"Create /api/users endpoint",
"Support GET, POST, PUT, DELETE",
"Include input validation",
"Add error handling"
],
"expected_duration": 300,
"validation_criteria": [
"Code compiles successfully",
"All endpoints return correct responses",
"Error cases handled appropriately"
]
}
# distributed-benchmark.yaml
distributed:
coordinator_node: "benchmark-controller"
worker_nodes:
- "worker-1"
- "worker-2"
- "worker-3"
load_distribution:
strategy: "round_robin"
tasks_per_node: 25
synchronization:
barrier_timeout: 300
result_collection_timeout: 600
# Start real-time monitoring dashboard
swarm-bench monitor --dashboard --port 8080
# Access at http://localhost:8080 to see:
# - Live performance metrics
# - Agent status and coordination
# - Task completion rates
# - Resource utilization graphs
{
"benchmark_id": "perf-test-2025-01-07",
"timestamp": "2025-01-07T16:30:00Z",
"configuration": {
"strategy": "optimization",
"mode": "mesh",
"agents": 8,
"tasks": 100
},
"results": {
"success_rate": 0.87,
"average_duration": 245.3,
"throughput": 14.6,
"total_duration": 4107.2
},
"agent_performance": {
"utilization": 0.73,
"coordination_overhead": 0.12,
"idle_time": 0.15
},
"resource_usage": {
"peak_memory_mb": 1024,
"average_cpu_percent": 65,
"network_bytes": 2048576
}
}
High Success Rate + High Duration
- Consider adding more agents
- Check for inefficient coordination
- Optimize task algorithms
Low Success Rate + Normal Duration
- Review error patterns
- Improve error handling
- Validate input data quality
Good Performance + High Resource Usage
- Optimize memory allocation
- Reduce coordination overhead
- Consider lighter coordination modes
Low Throughput
# Diagnose throughput issues
swarm-bench diagnose --metric throughput
# Common causes:
# - Insufficient agent count
# - Coordination bottlenecks
# - Resource constraints
# - Network latency (distributed mode)
High Error Rate
# Analyze error patterns
swarm-bench analyze-errors --input results/benchmark.json
# Check for:
# - Timeout issues (increase task timeout)
# - Resource exhaustion (reduce concurrent agents)
# - Coordination failures (check network stability)
Memory Issues
# Monitor memory usage during benchmarks
swarm-bench run --monitor-memory --profile-memory
# Optimize based on results:
# - Reduce agent count if memory constrained
# - Implement memory cleanup strategies
# - Use streaming for large data processing
# Enable debug logging
export LOG_LEVEL=DEBUG
swarm-bench run --debug --agents 3 --tasks 10
# Trace individual agent actions
swarm-bench run --trace-agents --output-traces traces/
- Establish Baselines: Run benchmarks on known configurations
- Control Variables: Change one parameter at a time
- Statistical Significance: Run multiple iterations
- Environment Consistency: Use identical hardware/software
- Realistic Workloads: Use representative task types
- Trend Analysis: Look for patterns over time
- Outlier Investigation: Understand extreme values
- Cross-Validation: Verify results with different approaches
- Business Context: Relate metrics to business objectives
- Regular Benchmarking: Automated performance testing
- Performance Goals: Set and track specific targets
- Regression Prevention: Detect performance degradation early
- Optimization Cycles: Regular performance tuning
- Performance Dashboard: Real-time monitoring interface
- Comparison Tools: Side-by-side result analysis
- Report Generators: Automated HTML/PDF reports
- Data Export: CSV/JSON export for external analysis
- Grafana Integration: Custom dashboards for metrics
- Prometheus Metrics: Time-series performance data
- CI/CD Integration: Automated performance gates
- Slack Notifications: Performance alerts and reports
Last updated: January 2025 Version: Alpha-88