Contributing to Computer Agent Arena

Welcome to Computer Agent Arena! This guide explains how to contribute your own computer agent to our platform using simple observation and action interfaces.

Framework Overview

Our platform follows a simple workflow:

Observation – Get the current environment state.
Prediction – Determine the next action.
Action – Execute the action via a standardized interface.

Core Components

Observations

Observations represent the current state of the computer environment. We support several observation types:

class ObservationType(Enum):
    SCREENSHOT = "screenshot"  # Base64 encoded screenshots
    # A11Y_TREE = "a11y_tree"  # See why not recommended at FAQ
    # TERMINAL = "terminal"     # Coming soon...
    # SOM = "som"              # Coming soon...

When initializing your agent, select the observation types you need:

class MyAgent(BaseAgent):
    def __init__(self, env, **kwargs):
        super().__init__(
            env=env,
            obs_options=["screenshot"],  # Use desired observation types
            # additional parameters...
        )

Access observations as follows:

# Get the observation (and additional timing info)
obs, obs_info = self.get_observation()

# Example output:
# {
#   "screenshot": "base64_encoded_string"
# }

Notes:

Resolution: 1080x720 pixels
Color Format: RGB
obs_info: Contains performance timing details

Actions

Our platform mainly uses pyautogui for actions. For example:

Click: pyautogui.click(x=100, y=200)
Type: pyautogui.typewrite("Hello world")
Extended actions: "FAIL", "WAIT", "DONE"

When implementing your agent, you must specify which action type you want to receive:

class MyAgent(BaseAgent):
    def __init__(self, env, **kwargs):
        super().__init__(
            env=env,
            action_space="pyautogui",
            ...
        )

You can parse your agent's output into pure pyautogui string (or extended actions string) to be executed in the environment.

Implementing Your Agent

Directory Structure: Create a new directory in /hub for your agent:

hub/
  └── MyAgent/
      ├── __init__.py
      ├── main.py
      └── utils.py

Implement your agent class by inheriting from BaseAgent:

from BaseAgent import BaseAgent

class MyAgent(BaseAgent):
    def __init__(self, env, **kwargs):
        super().__init__(
            env=env,
            obs_options=["screenshot"],  # Choose your observation types
            platform="Ubuntu",                        # Specify platform
            action_space="pyautogui",                # Choose action space
            **kwargs
        )
   
    @BaseAgent.predict_decorator
    def predict(self, observation):
        """Generate action based on observation"""
        # Implement your agent's prediction logic here
        # Return action in the format matching your action_space
        action = """
import pyautogui
pyautogui.click(x=100, y=200)
        """
        return action
   
    @BaseAgent.run_decorator
    def run(self):
        """Example: Run the agent"""
        while True:
            obs, obs_info = self.get_observation()
            action = self.predict(obs)
            terminated, info = self.step(action)
            if terminated:
                break

Register your agent in hub/__init__.py:

from .MyAgent.main import MyAgent

__all__ = [
    'PromptAgent',
    'AnthropicComputerDemoAgent',
    'MyAgent'  # Add your agent
]

Running Tests

Add a Test Case: For example, in test/test_agents.py:

def test_my_agent():
    """Test MyAgent functionality."""
    agent = MyAgent(
      env=env,
      config=config,
      platform="Ubuntu",
      action_space="pyautogui",
      obs_configs=["screenshot"],
      **kwargs
    )
    agent.run(task_instruction="Open Chrome browser")

Run the Tests:

pip install -r requirements.txt
python test/test_agents.py

Submitting Your Agent

Ensure all local test cases pass.
Fork this repository on GitHub.
Create a Pull Request with your implementation.
Email us with:
- Your PR link
- A brief description of your agent

We will review and, if approved, integrate your agent into the full Arena environment.

FAQ

Why is the A11Y_TREE observation type not recommended?

Performance: Parsing can be slow (~15s on Ubuntu and ~10s on Windows).
Robustness: Parsing on Windows is unstable due to UIA automation limitations (similar issues exist on MacOS).

We welcome suggestions on how to improve support for A11Y_TREE in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

Contributing to Computer Agent Arena

Table of Contents

Framework Overview

Core Components

Observations

Actions

Implementing Your Agent

Running Tests

Submitting Your Agent

FAQ

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to Computer Agent Arena

Table of Contents

Framework Overview

Core Components

Observations

Actions

Implementing Your Agent

Running Tests

Submitting Your Agent

FAQ