A simple local tool to convert PDF files into Obsidian-friendly Markdown files using EasyOCR or Docling engines with intelligent document date extraction.
- Inbox processing:
inbox/
-> conversion ->processed/
writes.md
next to.pdf
; on failure moves PDF tofailed/
. - Automatic language detection between English and Hungarian via a 1-page probe, enhanced with accent- and stopword-based scoring.
- Image-only PDF detection with configurable thresholds and default skip behavior (no Markdown created, file moved and logged).
- Two engines: EasyOCR + pdf2image (Poppler) at 300 DPI, or Docling for advanced PDF understanding.
- Plain-text logging in
logs/run.log
and resumable processing via--resume
andlogs/state.json
.
- macOS
- Homebrew Poppler (for
pdf2image
, only needed with the EasyOCR engine):
brew install poppler
- Python 3.10+ recommended
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
If you need a specific torch build (GPU/CUDA), refer to PyTorch install instructions. On Apple Silicon, EasyOCR runs on CPU; CUDA is not available.
inbox/
— place your input PDFs hereprocessed/
— processed originals and generated Markdown files (paired by basename)failed/
— PDFs that could not be processed (e.g., image-only per thresholds or runtime error)logs/
— run logs andstate.json
These will be created on first run if missing.
Best working command on Apple Silicon (no images, accurate tables, automatic language detection, Obsidian-friendly Markdown written to processed/
):
python pdf_to_markdown.py \
--engine docling \
--device mps \
--image-export-mode placeholder \
--table-extraction-mode accurate \
--fix-markdown-lint \
--inbox inbox \
--processed processed \
--failed failed \
--logs logs
- engine: switches to Docling for robust PDF→Markdown.
- device mps: Apple Silicon GPU acceleration path (Docling handles this internally).
- image-export-mode placeholder: no image files exported (smallest Markdown output). Use
referenced
to keep images next to the Markdown. - table-extraction-mode accurate: higher quality table extraction.
- fix-markdown-lint: applies post-processing fixes (removes trailing punctuation from headings, adds spaces after #, ensures proper list spacing, single trailing newline).
- YAML frontmatter is prepended to match Obsidian usage (
title
,source
,detected_language
,page_count
,processed_at
).
Docling does not require Poppler. Poppler is only needed for the EasyOCR path.
python pdf_to_markdown.py \
--engine easyocr \
--fix-markdown-lint \
--inbox inbox \
--processed processed \
--failed failed \
--dpi 300 \
--workers 4 \
--gpu-preference auto \
--resume \
--log logs/run.log \
--language-detection-pages 1 \
--image-only-action skip \
--min-text-chars 50 \
--min-boxes 3 \
--min-conf 0.35
--fix-markdown-lint
: applies post-processing fixes to improve Markdown quality.--resume
: append missing pages if an output.md
already exists inprocessed/
(EasyOCR only).--gpu-preference auto
: uses CUDA if available, otherwise CPU. Apple Silicon MPS is not used by EasyOCR; it will fall back to CPU.--image-only-action {skip,note}
: on low-text PDFs, defaultskip
moves the PDF tofailed/
and does not create Markdown.- Thresholds used during the probe to decide image-only:
--min-text-chars
,--min-boxes
,--min-conf
.
Each output file in processed/
has the same basename as the PDF, with YAML frontmatter and per-page sections, for example. The .md
sits next to the .pdf
, so they pair by name and sort together:
---
title: SampleDoc
source: [SampleDoc](SampleDoc.pdf)
detected_language: en
page_count: 3
document_date: 2024-03-15
---
## Page 1
...text...
## Page 2
...text...
The source
field links to the sibling PDF (same directory, relative filename only).
- A 1-page probe runs OCR in both
en
andhu
. - Scoring combines: average OCR confidence, accented-letter ratio (weight 0.2), and stopword hit rate (weight 0.3).
- The higher score determines
detected_language
for the full run. - Observed accuracy: correct for our sample set except one mixed-language document (started with some English but mostly Hungarian). Mixed pages can bias the probe; adjust manually by rerunning segregated PDFs if needed.
- Considered image-only if, during the probe, either:
- max(chars) <
--min-text-chars
AND max(boxes) <--min-boxes
, or - max(avg_conf) <
--min-conf
AND max(stopword_rate) < 0.005.
- max(chars) <
- Default action:
--image-only-action skip
.- Logs an
IMAGE_ONLY ...
line and moves the PDF tofailed/
. - No Markdown is produced.
- Logs an
- If Poppler is not installed or not in PATH, you will see an error about
pdftoppm
missing. - Logs are appended to
logs/run.log
. A per-file state is kept inlogs/state.json
for resume support.
Place a few sample PDFs in inbox/
and run the command above. Confirm:
- Language detection chooses
en
vshu
appropriately. - Markdown created in
processed/
(paired.md
next to.pdf
) with correct frontmatter and content sections. - Original PDFs moved to
processed/
after success. - Re-running with
--resume
appends missing pages when applicable (EasyOCR only). - Image-only PDFs are skipped (no
.md
), moved tofailed/
, andIMAGE_ONLY
entries appear inlogs/run.log
alongsidePROBE
,PAGE
, andFINISH
lines.