Skip to content

akora/pdf-to-markdown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDF to Markdown

A simple local tool to convert PDF files into Obsidian-friendly Markdown files using EasyOCR or Docling engines with intelligent document date extraction.

  • Inbox processing: inbox/ -> conversion -> processed/ writes .md next to .pdf; on failure moves PDF to failed/.
  • Automatic language detection between English and Hungarian via a 1-page probe, enhanced with accent- and stopword-based scoring.
  • Image-only PDF detection with configurable thresholds and default skip behavior (no Markdown created, file moved and logged).
  • Two engines: EasyOCR + pdf2image (Poppler) at 300 DPI, or Docling for advanced PDF understanding.
  • Plain-text logging in logs/run.log and resumable processing via --resume and logs/state.json.

Prerequisites

  • macOS
  • Homebrew Poppler (for pdf2image, only needed with the EasyOCR engine):
brew install poppler
  • Python 3.10+ recommended

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If you need a specific torch build (GPU/CUDA), refer to PyTorch install instructions. On Apple Silicon, EasyOCR runs on CPU; CUDA is not available.

Directory layout

  • inbox/ — place your input PDFs here
  • processed/ — processed originals and generated Markdown files (paired by basename)
  • failed/ — PDFs that could not be processed (e.g., image-only per thresholds or runtime error)
  • logs/ — run logs and state.json

These will be created on first run if missing.

Usage

Docling engine (recommended)

Best working command on Apple Silicon (no images, accurate tables, automatic language detection, Obsidian-friendly Markdown written to processed/):

python pdf_to_markdown.py \
  --engine docling \
  --device mps \
  --image-export-mode placeholder \
  --table-extraction-mode accurate \
  --fix-markdown-lint \
  --inbox inbox \
  --processed processed \
  --failed failed \
  --logs logs
  • engine: switches to Docling for robust PDF→Markdown.
  • device mps: Apple Silicon GPU acceleration path (Docling handles this internally).
  • image-export-mode placeholder: no image files exported (smallest Markdown output). Use referenced to keep images next to the Markdown.
  • table-extraction-mode accurate: higher quality table extraction.
  • fix-markdown-lint: applies post-processing fixes (removes trailing punctuation from headings, adds spaces after #, ensures proper list spacing, single trailing newline).
  • YAML frontmatter is prepended to match Obsidian usage (title, source, detected_language, page_count, processed_at).

Docling does not require Poppler. Poppler is only needed for the EasyOCR path.

EasyOCR engine

python pdf_to_markdown.py \
  --engine easyocr \
  --fix-markdown-lint \
  --inbox inbox \
  --processed processed \
  --failed failed \
  --dpi 300 \
  --workers 4 \
  --gpu-preference auto \
  --resume \
  --log logs/run.log \
  --language-detection-pages 1 \
  --image-only-action skip \
  --min-text-chars 50 \
  --min-boxes 3 \
  --min-conf 0.35
  • --fix-markdown-lint: applies post-processing fixes to improve Markdown quality.
  • --resume: append missing pages if an output .md already exists in processed/ (EasyOCR only).
  • --gpu-preference auto: uses CUDA if available, otherwise CPU. Apple Silicon MPS is not used by EasyOCR; it will fall back to CPU.
  • --image-only-action {skip,note}: on low-text PDFs, default skip moves the PDF to failed/ and does not create Markdown.
  • Thresholds used during the probe to decide image-only: --min-text-chars, --min-boxes, --min-conf.

Output format

Each output file in processed/ has the same basename as the PDF, with YAML frontmatter and per-page sections, for example. The .md sits next to the .pdf, so they pair by name and sort together:

---
title: SampleDoc
source: [SampleDoc](SampleDoc.pdf)
detected_language: en
page_count: 3
document_date: 2024-03-15
---

## Page 1

...text...

## Page 2

...text...

The source field links to the sibling PDF (same directory, relative filename only).

Language detection and accuracy

  • A 1-page probe runs OCR in both en and hu.
  • Scoring combines: average OCR confidence, accented-letter ratio (weight 0.2), and stopword hit rate (weight 0.3).
  • The higher score determines detected_language for the full run.
  • Observed accuracy: correct for our sample set except one mixed-language document (started with some English but mostly Hungarian). Mixed pages can bias the probe; adjust manually by rerunning segregated PDFs if needed.

Image-only PDFs

  • Considered image-only if, during the probe, either:
    • max(chars) < --min-text-chars AND max(boxes) < --min-boxes, or
    • max(avg_conf) < --min-conf AND max(stopword_rate) < 0.005.
  • Default action: --image-only-action skip.
    • Logs an IMAGE_ONLY ... line and moves the PDF to failed/.
    • No Markdown is produced.

Notes

  • If Poppler is not installed or not in PATH, you will see an error about pdftoppm missing.
  • Logs are appended to logs/run.log. A per-file state is kept in logs/state.json for resume support.

Testing

Place a few sample PDFs in inbox/ and run the command above. Confirm:

  • Language detection chooses en vs hu appropriately.
  • Markdown created in processed/ (paired .md next to .pdf) with correct frontmatter and content sections.
  • Original PDFs moved to processed/ after success.
  • Re-running with --resume appends missing pages when applicable (EasyOCR only).
  • Image-only PDFs are skipped (no .md), moved to failed/, and IMAGE_ONLY entries appear in logs/run.log alongside PROBE, PAGE, and FINISH lines.

About

Extract content from scanned documents and save them as markdown, ready for Obsidian

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages