Docproc

A Python-based document region analyzer and content extraction tool.

Warning

Project is under active development so most of the features aren't implemented, The readme is written to understand project scope.

Overview

Docproc is an opinionated document region analyzer that helps extract text, equations, images and handwriting from documents. It provides both a library interface and a command-line tool.

Repository Flow

flowchart TD
%% User Interface Layer
subgraph "User Interface"
UI["User Input"]:::cli
CLI["CLI"]:::cli
end
UI -->|"initiates"| CLI

    %% Core Processing Layer
    subgraph "Core Processing"
        DA["Document Analyzer"]:::core
        ED["Equations Detector"]:::core
        HD["Handwriting Detector"]:::core
        RD["Regions Detector"]:::core
    end
    CLI -->|"processes"| DA
    DA -->|"detects"| ED
    DA -->|"detects"| HD
    DA -->|"detects"| RD

    %% Output Generation Layer
    subgraph "Output Generation"
        CSV["CSV Writer"]:::writer
        JSON["JSON Writer"]:::writer
        SQLITE["SQLite Writer"]:::writer
        FILE["Generic File Writer"]:::writer
    end
    DA -->|"exports"| CSV
    DA -->|"exports"| JSON
    DA -->|"exports"| SQLITE
    DA -->|"exports"| FILE

    %% Environment & Testing Layer
    subgraph "Environment & Testing"
        DE1["pyproject.toml"]:::env
        DE2["shell.nix"]:::env
        TS["Test Suite"]:::test
        CI[".github Directory"]:::env
    end
    DE1 -.->|"env"| CLI
    DE2 -.->|"env"| CLI
    CI -.->|"CI"| CLI
    TS -.->|"tests"| DA

    %% Styles
    classDef cli fill:#ADD8E6,stroke:#000,stroke-width:1px;
    classDef core fill:#90EE90,stroke:#000,stroke-width:1px;
    classDef writer fill:#FFD700,stroke:#000,stroke-width:1px;
    classDef env fill:#D3D3D3,stroke:#000,stroke-width:1px;
    classDef test fill:#FFB6C1,stroke:#000,stroke-width:1px;

    %% Click Events
    click CLI "https://github.com/rithulkamesh/docproc/blob/main/docproc/bin/cli.py"
    click DA "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/analyzer.py"
    click ED "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/equations.py"
    click HD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/handwriting.py"
    click RD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/regions.py"
    click CSV "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/csv.py"
    click JSON "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/json.py"
    click SQLITE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/sqlite.py"
    click FILE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/filewriter.py"
    click DE1 "https://github.com/rithulkamesh/docproc/blob/main/pyproject.toml"
    click DE2 "https://github.com/rithulkamesh/docproc/blob/main/shell.nix"
    click TS "https://github.com/rithulkamesh/docproc/tree/main/tests/"
    click CI "https://github.com/rithulkamesh/docproc/blob/main/.github Directory"

This diagram was generated by GitDiagram. A shoutout.

Installation

# Using pip
pip install docproc

Usage

As a Command-line Tool

# Basic usage
docproc input.pdf

# Specify output format and file
docproc input.pdf -w csv -o output.csv
docproc input.pdf -w sqlite -o database.db
docproc input.pdf -w json -o output.json

# Extract only specific region types
docproc input.pdf --regions text equation
docproc input.pdf -r text image  # Short form

# Enable verbose logging
docproc input.pdf -v

Supported output formats:

CSV (default)
SQLite
JSON

As a Library

from docproc.doc.analyzer import DocumentAnalyzer
from docproc.writer import CSVWriter

# Using context manager (recommended)
with DocumentAnalyzer("input.pdf", CSVWriter, output_path="output.csv") as analyzer:
    regions = analyzer.detect_regions()
    analyzer.export_regions()

Roadmap

The following features are planned for upcoming releases:

Handwriting Recognition: Detect and extract handwritten content from documents

Development

uv sync

Contributing

Pull requests are welcome. Please ensure tests pass before submitting.

Contact

For any questions, feedback or suggestions, please contact the author @ hi@rithul.dev

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
docproc		docproc
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
shell.nix		shell.nix
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Docproc

Overview

Repository Flow

Installation

Usage

As a Command-line Tool

As a Library

Roadmap

Development

Contributing

Contact

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Languages

Uh oh!

License

rithulkamesh/docproc

Folders and files

Latest commit

History

Repository files navigation

Docproc

Overview

Repository Flow

Installation

Usage

As a Command-line Tool

As a Library

Roadmap

Development

Contributing

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages