A robust Python-based tool for organizing and maintaining a digital file vault with intelligent file processing, metadata extraction, and consistent naming conventions.
-
Intelligent File Processing
- Specialized processors for images, videos, PDFs, and text files
- Automatic file type detection and categorization
- Consistent file naming across all formats
- Metadata extraction and preservation
- Duplicate detection using XXH64 checksums
-
Hierarchical Organization
- Configurable date-based directory structure (none/year/month/day)
- Type-based categorization (images, videos, documents, notes)
- Smart subcategorization (e.g., scanned vs regular documents)
-
Metadata Management
- EXIF data extraction from images
- Video technical metadata (resolution, fps, HDR)
- PDF metadata and document type detection
- Text file analysis (word count, frontmatter parsing)
- Creation dates and timestamps preservation
-
Database Integration
- SQLite database for file tracking
- Checksum-based duplicate prevention
- Quick file lookup and metadata querying
- Clone this repository
- Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
- On Unix/MacOS:
source venv/bin/activate
- On Windows:
.\venv\Scripts\activate
- On Unix/MacOS:
- Install dependencies:
pip install -r requirements.txt
- Python 3.8 or higher
- External dependencies:
exiftool
for image metadata extractionffprobe
(part of ffmpeg) for video metadata extraction
Edit config.py
to customize:
- Input directory (
INBOX_DIR
) - Vault directory (
VAULT_DIR
) - Date hierarchy level:
DATE_HIERARCHY_NONE
: flat structureDATE_HIERARCHY_YEAR
: year foldersDATE_HIERARCHY_MONTH
: year/month foldersDATE_HIERARCHY_DAY
: year/month/day folders
- Regular formats (stored in
images/
): jpg, jpeg, png, gif, bmp, webp, tiff, tif - RAW formats (stored in
photos/raw/
): heic, arw (Sony), cr2 (Canon), nef (Nikon), raf (Fuji), dng, raw
Naming format:
d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-r[WIDTHxHEIGHT][-sc######][-MAKE-MODEL].ext
- Formats: mp4, mov, avi, mkv, wmv, flv, webm, mpg, mpeg, mts, m2ts, m4v, 3gp
Naming format:
d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-[RESOLUTION]-[FPS]-[DURATION][-HDR].ext
- Categories: ebooks, scanned documents, regular PDFs
Naming format:
[scanned-]d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-p[PAGES][-TITLE/SCANNER].pdf
- Formats: txt, md
- Support for YAML frontmatter in markdown
- Automatic title extraction from frontmatter or content
Naming format:
d[YYYYMMDD]-t[HHMMSS]-s[BYTES]-w[WORDCOUNT][-TITLE].ext
# Process new files in inbox
python vault_builder.py
# Clean up inbox (remove empty dirs and hidden files)
python inbox_cleaner.py
Files are organized into these main directories:
images/
: Regular image files (jpg, png, etc.)photos/raw/
: RAW photo files (arw, cr2, nef, etc.)videos/
: All video filesdocuments/
: PDFs and other documentsdocuments/scanned/
: Scanned documentsdocuments/ebooks/
: Detected ebooks
notes/
: Text and markdown files
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request