PeriGuru - A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

Overview

Smartphones have become an integral part of daily life, serving as tools for reading, learning, socializing, and shopping. However, not all users find mobile apps easy to navigate.

Smartphones have significantly enhanced our daily learning, communication, and entertainment, becoming an essential component of modern life. However, certain populations, including the seniors and individuals with visual impairments, encounter challenges in utilizing smartphones, thus necessitating mobile app operation assistants, a.k.a. mobile app agent.

With considerations for privacy, permissions, and cross-platform compatibility issues, we endeavor to devise and develop a peripheral robotic mobile app operation assistant, PeriGuru. PeriGuru leverages a suite of computer vision (CV) techniques to analyze GUI screenshot images and employs LLM to inform action decisions, which are then executed by robotic arms.

Open Source Credits

This project makes use of the following open-source projects, for which we are grateful:

Name	License	Link
AppAgent	MIT license	https://github.com/mnotgod96/AppAgent
BGEM3	MIT license	https://hf-mirror.com/BAAI/bge-m3/tree/main
GUI-Perceptual-Grouping	Apache-2.0 license	https://github.dev/MulongXie/GUI-Perceptual-Grouping
LabelDroid	GPL-3.0 license	https://github.com/chenjshnn/LabelDroid
UIED	Apache-2.0 license	https://github.com/MulongXie/UIED
YOLOv5	AGPL-3.0 license	https://github.com/ultralytics/yolov5

Installation and Usage Instructions

The code was tested on Ubuntu 20.04, with Python 3.8.10, PyTorch 2.2.1, and torchvision 0.17.1.

Clone this repository.

git clone https://github.com/Z2sJ4t/PeriGuru.git
cd PeriGuru

Install the requirements.

pip install -r requirements.txt

Install third-party repositories.

Please install the third-party repositories and models in the third_party folder, which includes BGEM3, LabelDroid, and YOLOv5.

Set your API Key.

Please replace the OpenAI API key at task_executor/LLM/LLM_agent.py line 8 with your own. If you need to use other LLM models, you can make modifications in task_executor/LLM/model.py.

The configuration of OCR model is in GUI/UIED/text/ocr_method. You can use the Baidu API usage example provided in GUI/UIED/text/ocr_method/baidu.py and replace the API key on line 9 with your own, or write the calling interface for other API key refering to it.

Configure your robotic arm.

The robotic arm used in PeriGuru's testing is the yahboom DOFBOT SE. The file for configuring robot motion is robot_movement/robot.py. You can modify this file to fit your own robotic arm.

Folder structure

camera/

Configure the camera and obtain screenshots.

GUI/

Identify UI elements and layout.

task_executor/

Generate action strategies to complete tasks.

robot_movement/

Guide the movement of the robotic arm and execute actions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PeriGuru - A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

Overview

Open Source Credits

Installation and Usage Instructions

Folder structure

Demo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GUI		GUI
asset		asset
camera		camera
output		output
robot_movement		robot_movement
task_executor		task_executor
third_party		third_party
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

Z2sJ4t/PeriGuru

Folders and files

Latest commit

History

Repository files navigation

PeriGuru - A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

Overview

Open Source Credits

Installation and Usage Instructions

Folder structure

Demo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages