PromptOps is a local-first LLM automation system that mimics how humans operate a computer—via reasoning, vision, and keystrokes. It interprets natural language goals and executes them like a real user would, using keyboard inputs and visual feedback to interact with applications.
PromptOps takes natural language prompts, plans the necessary steps, and simulates human-like actions—typing, scrolling, reading screen content—to execute the task on a desktop autonomously.
We used Python for core logic, integrating pyautogui/pynput for UI simulation and Gemini for LLM reasoning. The system includes a planner, a skill execution engine, and a vision layer that parses screen content to guide decisions.
- Reliable UI control without clicking
- Parsing dynamic screen content contextually
- Balancing flexibility with deterministic execution
- Designing prompt interpretation without rigid skill trees
- A modular LLM-agent pipeline with screen-grounded actions
- Local-first design with no external APIs required
- Real-time execution based on visible UI context
- Planner that adapts actions based on outcomes
- LLMs can simulate goal-directed human behavior when grounded in visual input
- Skill-based design is brittle early on; prompt-based planning is more flexible
- Abstracting actions into reusable modules improves maintainability and growth potential
- Add support for dynamic skill generation using LLMs
- Integrate full vision-based UI navigation
- Build memory and long-term goal management
- Extend to goal-based software creation from prompts
- main.py: Entry point that loads model and initializes all agents and controller
- PlannerAgent: Converts user prompt into a structured plan (dict of steps)
- EvaluatorAgent: Validates execution outcomes and identifies failures
- FixerAgent: Attempts to replan or fix issues if execution fails
- ClarifierAgent: Requests clarification from the user if the prompt is ambiguous
- VisionAgent: Takes screenshots and interprets screen state using an LLM vision analyzer
- Memory: Tracks plan steps, history, and prior context
- Controller: Central executor coordinating planner, vision, and evaluator to run the task
- "Search Google for latest tech news"
- "Write a three-line summary in Notepad"
- "List all files in Downloads folder via terminal"
All executed via reasoning + keyboard, without direct UI automation or APIs.
- Doesn’t rely on hardcoded scripts, XPath selectors, or app-specific APIs
- Doesn’t use robotic mouse control—fully keyboard driven
- Uses vision as a feedback mechanism to emulate human perception
- Python (core logic)
- OpenAI Vision or Gemini Vision (LLM-based screen reading)
- PyAutoGUI / Pynput (keyboard control)
- FastAPI for API hooks (optional)
- Full multi-agent loop (planner, executor, evaluator, fixer)
- File system awareness + context memory
- Human-like web browsing & data extraction
- Task persistence + retry logs
MIT License