Skip to content

Commit e1088ec

Browse files
committed
Update readme
1 parent 972d972 commit e1088ec

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

docs/visual_actions_comparison.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Visual Actions Comparison: GroundingDINO+EasyOCR vs YOLO-based Detection
1+
# Visual Actions Comparison: GroundingDINO+EasyOCR vs OmniParser Detection
22

33
This document provides a detailed comparison between the two visual action approaches used in the Crab framework for GUI element detection and interaction.
44

@@ -11,7 +11,7 @@ This document provides a detailed comparison between the two visual action appro
1111
- [EasyOCR](https://github.com/JaidedAI/EasyOCR) for text recognition
1212
- **Primary Use Case**: General-purpose object detection with text recognition
1313

14-
### New Approach (YOLO-based Detection)
14+
### New Approach (OmniParser Detection)
1515
- **Implementation**: Located in `crab/actions/omniparser_visual_actions.py`
1616
- **Core Technologies**:
1717
- Custom YOLO model optimized for GUI element detection
@@ -22,7 +22,7 @@ This document provides a detailed comparison between the two visual action appro
2222

2323
### 1. Model Architecture
2424

25-
| Aspect | Legacy Approach | YOLO-based Detection |
25+
| Aspect | Legacy Approach | OmniParser Detection |
2626
|--------|----------------|------------|
2727
| Architecture | Transformer-based (GroundingDINO)<br>+ Separate OCR model | Single YOLO model<br>+ Configurable OCR |
2828
| Model Size | ~1.5GB combined | ~50MB (YOLO)<br>+ ~250MB (OCR) |
@@ -31,7 +31,7 @@ This document provides a detailed comparison between the two visual action appro
3131

3232
### 2. Advanced Capabilities
3333

34-
| Capability | Legacy Approach | YOLO-based Detection |
34+
| Capability | Legacy Approach | OmniParser Detection |
3535
|------------|----------------|---------------------------|
3636
| **OCR** | EasyOCR only | PaddleOCR and EasyOCR |
3737
| **Caption Generation** | Basic element labels | Basic element labels |
@@ -42,7 +42,7 @@ This document provides a detailed comparison between the two visual action appro
4242

4343
### 3. Performance Metrics
4444

45-
| Metric | Legacy Approach | YOLO-based Detection |
45+
| Metric | Legacy Approach | OmniParser Detection |
4646
|--------|----------------|------------|
4747
| Total Processing Time | 3-5s per image | 0.8-1.5s per image |
4848
| Object Detection Time | 2-3s | 0.5-1s |
@@ -64,7 +64,7 @@ This document provides a detailed comparison between the two visual action appro
6464
- No fast processing
6565
- No confidence scores
6666

67-
#### YOLO-based Detection
67+
#### OmniParser Detection
6868
- Fast GUI element detection
6969
- Confidence scores
7070
- Low resource usage
@@ -211,7 +211,7 @@ final_image, prompt = get_elements_prompt(
211211
).run()
212212
```
213213

214-
### YOLO-based Detection
214+
### OmniParser Detection
215215
```python
216216
from crab.actions.omniparser_visual_actions import detect_and_annotate_gui_elements
217217

@@ -247,12 +247,12 @@ The comparison tests evaluate:
247247

248248
## Conclusion
249249

250-
The YOLO-based detection now offers a complete alternative to the legacy approach:
250+
The OmniParser Detection now offers a complete alternative to the legacy approach:
251251
- Faster processing times (2-3x speedup)
252252
- Smaller core model size (30x smaller)
253253
- Choice of OCR engines
254254
- Better GUI element detection accuracy
255255
- Enhanced box filtering with OCR awareness
256256
- Confidence-based classification
257257

258-
While the legacy approach still has some unique capabilities (multi-image processing, general object detection), the YOLO-based approach provides a more efficient and specialized solution for GUI automation tasks. Future improvements will focus on adding multi-image support and enhancing semantic understanding using OmniParser's capabilities.
258+
While the legacy approach still has some unique capabilities (multi-image processing, general object detection), the OmniParser Detection provides a more efficient and specialized solution for GUI automation tasks. Future improvements will focus on adding multi-image support and enhancing semantic understanding using OmniParser's capabilities.

0 commit comments

Comments
 (0)