A Python-based document region analyzer and content extraction tool.
Warning
Project is under active development so most of the features aren't implemented, The readme is written to understand project scope.
Docproc is an opinionated document region analyzer that helps extract text, equations, images and handwriting from documents. It provides both a library interface and a command-line tool.
flowchart TD
%% User Interface Layer
subgraph "User Interface"
UI["User Input"]:::cli
CLI["CLI"]:::cli
end
UI -->|"initiates"| CLI
%% Core Processing Layer
subgraph "Core Processing"
DA["Document Analyzer"]:::core
ED["Equations Detector"]:::core
HD["Handwriting Detector"]:::core
RD["Regions Detector"]:::core
end
CLI -->|"processes"| DA
DA -->|"detects"| ED
DA -->|"detects"| HD
DA -->|"detects"| RD
%% Output Generation Layer
subgraph "Output Generation"
CSV["CSV Writer"]:::writer
JSON["JSON Writer"]:::writer
SQLITE["SQLite Writer"]:::writer
FILE["Generic File Writer"]:::writer
end
DA -->|"exports"| CSV
DA -->|"exports"| JSON
DA -->|"exports"| SQLITE
DA -->|"exports"| FILE
%% Environment & Testing Layer
subgraph "Environment & Testing"
DE1["pyproject.toml"]:::env
DE2["shell.nix"]:::env
TS["Test Suite"]:::test
CI[".github Directory"]:::env
end
DE1 -.->|"env"| CLI
DE2 -.->|"env"| CLI
CI -.->|"CI"| CLI
TS -.->|"tests"| DA
%% Styles
classDef cli fill:#ADD8E6,stroke:#000,stroke-width:1px;
classDef core fill:#90EE90,stroke:#000,stroke-width:1px;
classDef writer fill:#FFD700,stroke:#000,stroke-width:1px;
classDef env fill:#D3D3D3,stroke:#000,stroke-width:1px;
classDef test fill:#FFB6C1,stroke:#000,stroke-width:1px;
%% Click Events
click CLI "https://github.com/rithulkamesh/docproc/blob/main/docproc/bin/cli.py"
click DA "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/analyzer.py"
click ED "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/equations.py"
click HD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/handwriting.py"
click RD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/regions.py"
click CSV "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/csv.py"
click JSON "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/json.py"
click SQLITE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/sqlite.py"
click FILE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/filewriter.py"
click DE1 "https://github.com/rithulkamesh/docproc/blob/main/pyproject.toml"
click DE2 "https://github.com/rithulkamesh/docproc/blob/main/shell.nix"
click TS "https://github.com/rithulkamesh/docproc/tree/main/tests/"
click CI "https://github.com/rithulkamesh/docproc/blob/main/.github Directory"
This diagram was generated by GitDiagram. A shoutout.
# Using pip
pip install docproc
# Basic usage
docproc input.pdf
# Specify output format and file
docproc input.pdf -w csv -o output.csv
docproc input.pdf -w sqlite -o database.db
docproc input.pdf -w json -o output.json
# Extract only specific region types
docproc input.pdf --regions text equation
docproc input.pdf -r text image # Short form
# Enable verbose logging
docproc input.pdf -v
Supported output formats:
- CSV (default)
- SQLite
- JSON
from docproc.doc.analyzer import DocumentAnalyzer
from docproc.writer import CSVWriter
# Using context manager (recommended)
with DocumentAnalyzer("input.pdf", CSVWriter, output_path="output.csv") as analyzer:
regions = analyzer.detect_regions()
analyzer.export_regions()
The following features are planned for upcoming releases:
- Handwriting Recognition: Detect and extract handwritten content from documents
uv sync
Pull requests are welcome. Please ensure tests pass before submitting.
For any questions, feedback or suggestions, please contact the author @ hi@rithul.dev