Skip to content

Opinionated and Sophisticated Document Region Analyzer.

License

Notifications You must be signed in to change notification settings

rithulkamesh/docproc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docproc

A Python-based document region analyzer and content extraction tool.

Warning

Project is under active development so most of the features aren't implemented, The readme is written to understand project scope.

Overview

Docproc is an opinionated document region analyzer that helps extract text, equations, images and handwriting from documents. It provides both a library interface and a command-line tool.

Repository Flow

Loading
flowchart TD
%% User Interface Layer
subgraph "User Interface"
UI["User Input"]:::cli
CLI["CLI"]:::cli
end
UI -->|"initiates"| CLI

    %% Core Processing Layer
    subgraph "Core Processing"
        DA["Document Analyzer"]:::core
        ED["Equations Detector"]:::core
        HD["Handwriting Detector"]:::core
        RD["Regions Detector"]:::core
    end
    CLI -->|"processes"| DA
    DA -->|"detects"| ED
    DA -->|"detects"| HD
    DA -->|"detects"| RD

    %% Output Generation Layer
    subgraph "Output Generation"
        CSV["CSV Writer"]:::writer
        JSON["JSON Writer"]:::writer
        SQLITE["SQLite Writer"]:::writer
        FILE["Generic File Writer"]:::writer
    end
    DA -->|"exports"| CSV
    DA -->|"exports"| JSON
    DA -->|"exports"| SQLITE
    DA -->|"exports"| FILE

    %% Environment & Testing Layer
    subgraph "Environment & Testing"
        DE1["pyproject.toml"]:::env
        DE2["shell.nix"]:::env
        TS["Test Suite"]:::test
        CI[".github Directory"]:::env
    end
    DE1 -.->|"env"| CLI
    DE2 -.->|"env"| CLI
    CI -.->|"CI"| CLI
    TS -.->|"tests"| DA

    %% Styles
    classDef cli fill:#ADD8E6,stroke:#000,stroke-width:1px;
    classDef core fill:#90EE90,stroke:#000,stroke-width:1px;
    classDef writer fill:#FFD700,stroke:#000,stroke-width:1px;
    classDef env fill:#D3D3D3,stroke:#000,stroke-width:1px;
    classDef test fill:#FFB6C1,stroke:#000,stroke-width:1px;

    %% Click Events
    click CLI "https://github.com/rithulkamesh/docproc/blob/main/docproc/bin/cli.py"
    click DA "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/analyzer.py"
    click ED "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/equations.py"
    click HD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/handwriting.py"
    click RD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/regions.py"
    click CSV "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/csv.py"
    click JSON "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/json.py"
    click SQLITE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/sqlite.py"
    click FILE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/filewriter.py"
    click DE1 "https://github.com/rithulkamesh/docproc/blob/main/pyproject.toml"
    click DE2 "https://github.com/rithulkamesh/docproc/blob/main/shell.nix"
    click TS "https://github.com/rithulkamesh/docproc/tree/main/tests/"
    click CI "https://github.com/rithulkamesh/docproc/blob/main/.github Directory"

This diagram was generated by GitDiagram. A shoutout.

Installation

# Using pip
pip install docproc

Usage

As a Command-line Tool

# Basic usage
docproc input.pdf

# Specify output format and file
docproc input.pdf -w csv -o output.csv
docproc input.pdf -w sqlite -o database.db
docproc input.pdf -w json -o output.json

# Extract only specific region types
docproc input.pdf --regions text equation
docproc input.pdf -r text image  # Short form

# Enable verbose logging
docproc input.pdf -v

Supported output formats:

  • CSV (default)
  • SQLite
  • JSON

As a Library

from docproc.doc.analyzer import DocumentAnalyzer
from docproc.writer import CSVWriter

# Using context manager (recommended)
with DocumentAnalyzer("input.pdf", CSVWriter, output_path="output.csv") as analyzer:
    regions = analyzer.detect_regions()
    analyzer.export_regions()

Roadmap

The following features are planned for upcoming releases:

  • Handwriting Recognition: Detect and extract handwritten content from documents

Development

uv sync

Contributing

Pull requests are welcome. Please ensure tests pass before submitting.

Contact

For any questions, feedback or suggestions, please contact the author @ hi@rithul.dev