This repository is a proof-of-concept (POC) for detecting duplicate images using facial recognition techniques. It leverages DeepFace for face detection and feature embedding, and provides options for both single-process and multi-process deduplication. Celery integration allows for distributed task processing, making it suitable for larger datasets.
-
Face Recognition:
- Detect faces in images.
- Generate feature embeddings using DeepFace.
-
Duplicate Detection:
- Compare embeddings to identify duplicate or similar images.
- Flexible backend and model configurations.
-
Multi-Process Support:
- Local execution with single or multiple processes.
- Distributed task processing using Celery.
-
Performance Metrics:
- Tracks encoding and deduplication times.
- Provides a comprehensive HTML report.
uv venv
uv sync
export DEEPFACE_HOME=$PWD
This sets the working directory as the DeepFace home, ensuring all necessary models and configurations are correctly accessed.
Prepare a directory with images (e.g., data/IMAGES
) and run:
dedupe data/IMAGES -p 1
For improved performance on larger datasets, specify the number of processes (e.g., 4):
dedupe data/IMAGES -p 4
In the first terminal, start a Celery worker:
watchmedo auto-restart --directory=./src/ --pattern *.py --recursive -- celery -A recognizeapp.c.app worker
In a second terminal, run the deduplication task with Celery:
dedupe data/IMAGES -p 4 --queue
Flower provides a web interface to monitor Celery workers and tasks.
In the first terminal:
watchmedo auto-restart --directory=./src/ --pattern *.py --recursive -- celery -A recognizeapp.c.app flower
Open your browser and navigate to:
http://localhost:5555
The project uses DeepFace for face detection and embedding generation. Supported models and backends include:
- Models:
VGG-Face
,Facenet
,DeepFace
,ArcFace
, and others. - Backends:
opencv
,mtcnn
,retinaface
, and more.
The DeepFace.represent()
function generates 128-dimensional feature vectors for each detected face, which are then compared to identify duplicates.
-
Encoding:
- Images are processed to extract face embeddings.
- Images without detectable faces are flagged as
NO_FACE_DETECTED
.
-
Comparison:
- Feature vectors are compared using cosine similarity or distance-based metrics.
- Duplicate pairs are identified if the similarity exceeds a defined threshold.
-
Report Generation:
- Results are saved in a JSON format and an HTML report is generated.
- Metrics like total time, new images processed, and findings are included.
- For local execution, Python's
multiprocessing
module is used to parallelize encoding and deduplication. - Distributed task execution with Celery allows for scaling across multiple machines.
-p, --processes
: Number of processes to use (default: CPU count).--queue
: Use Celery for distributed task processing.--reset
: Reset findings and encodings before processing.--report
: Generate an HTML report after deduplication.--model-name
: Specify the model name to use (e.g.,VGG-Face
,ArcFace
,Facenet
, ...).--detector-backend
: Specify the model name to use (e.g.,retinaface
,mtcnn
, ...).
-
NO_FACE_DETECTED
for Valid Images:- Ensure the correct model is specified (e.g.,
VGG-Face
,ArcFace
. Default to VGG-Face) or the correct backend. - Try enabling
enforce_detection=False
inDeepFace
.
- Ensure the correct model is specified (e.g.,
-
Celery Worker Not Starting:
- Check if
watchmedo
is installed:uv sync
. - Verify Celery configurations.
- Check if
-
Performance Issues:
- Use multiple processes for large datasets.
- Chose light models (but loose on the accuracy)
- Add support for storing encodings in databases (e.g., postgres, redis).
- Integrate GPU acceleration for faster embedding generation.
- Extend reporting capabilities with more detailed analytics.