This repo aims at using Dreambooth to teach a Diffusion model to learn my pictures and generate images of me from text prompts. I fine-tune stable-diffusion-xl model from Huggingface (over 10GB in size) on a single Turing T4 GPU (16GB) on Google Colab using LoRA and Accelerate from Huggingface. The repo also looks at merging different LoRA adapters in order to merge styles.
- Can Low Rank Adapters work well for training Dreambooth?
- Dreambooth works well on pictures of objects, can it learn to represent human faces well?
- How many images do I need to teach the model about myself?
- What is Prior Preservation?
- How will a model recognize me from the text prompt?
- Can we merge different Adapters to learn different styles?
- What are some difficulties when it comes to training on human faces and how can we offset them?
- On what text prompts does the model do well and when does it mess up?
- Project Structure
- Dataset
- Prior-Preservation
- Training
- Results
- Merging Adapters
- Limitations
- References
Project Structure ↩
- The data directory contains 6 high resolution images of me. This folder also contains prior.zip which contains 197 images of human faces (excluding my own). These images are used to train the model with prior preservation.
- The Train_Raj.ipynb contains a notebook to train with and without Prior Preservation.
- The dream_booth.py notebook contains the model and the script to train it. The script is a simpler adaptation of this script from Huggingface
- The Dreambooth_Qualitative_Inference.ipynb contains a comprehensive and structured inference of the models trained with and without Prior Preservation. This contains all the images generated from text prompts post training.
- The Dreambooth_Quantitaive_Inference.ipynb.ipynb notebook contains quantitative evaluation metrics.
- The eval.py notebook contains the official evaluation script from the original Dreambooth Repo. Google Research - Dreambooth
Dataset ↩
The data contains 6 high resolution images of me. For Dreambooth, it is important that these images cover different angles and clearly display the face. According to the experiments, 5-6 images are enough to train stable-diffusion-xl (SDXL) with LoRA. For prior preservation, we also use 197 images of other humans faces to increase diversity and reduce language drift. These images are generated by the same Diffusion model itself.
Prior-Preservation ↩
Fine-tuning layers that are conditioned on the text embeddings, gives rise to the problem of language drift where a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language. This phenomenon also affects diffusion models, where to model slowly forgets how to generate subjects of the same class as the target subject.
Another problem is the possibility of reduced output diversity. Text-to-image diffusion models naturally posses high amounts of output diversity. When fine-tuning on a small set of images we would like to be able to generate the subject in novel viewpoints, poses and articulations. Yet, there is a risk of reducing the amount of variability in the output poses and views of the subject. To mitigate the two aforementioned issues, the paper proposes an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. The method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins. This allows it to generate diverse images of the class prior, as well as retain knowledge about the class prior that it can use in conjunction with knowledge about the subject instance.
Training ↩
To accomodate such a large model on a 16GB Turing T4 GPU, I make use of gradient accumulation, gradient checkpointing, 8-bit fused Adam (instead of the regular Adam). Training on 6 images for 1000 steps is conducted with and without the prior preservation loss to verify that the prior preservation actually helps.
In order ot teach the model a mapping between text and a subject, Dreambooth proposes using a rare token from model's vocabulary and combining it with the subject prior. For instance, to train on my face, I use the prompt
A photo of rraj person
Here rraj is the rare vocabulary token and person is the class prior for the subject.
Results ↩
It turns out that LoRA + Dreambooth with 1000 steps works decently well on human faces as well. Prior-Preservation definitely improves the model (as seen from images below). For me, the PNDM Scheduler works well with just 50 timesteps and DDIM with 80 timesteps.
prompt = "a painting of rraj person at Oktoberfest"
prompt = "a painting of rraj person in the style of Van Gogh"
prompt = "a painting of rraj person with blonde hair"
prompt = "a side view photo of rraj person"
prompt = "a of rraj person with sunglasses"
CLIP-I is the average pairwise cosine similarity between CLIP embeddings of generated and real images.
Scheduler | Steps | Prior Preservation | CLIP-I |
---|---|---|---|
DDIM | 50 | No | 0.9580 |
DDIM | 50 | Yes | 0.9760 |
DDIM | 80 | No | 0.9663 |
DDIM | 80 | Yes | 0.9683 |
PNDM | 50 | No | 0.9761 |
PNDM | 50 | Yes | 0.9702 |
PNDM | 80 | No | 0.9751 |
PNDM | 80 | Yes | 0.9688 |
Merging Adapters ↩
I experiment generating my images in pixel-art style using two merged adapters. Particularly, I experiment with generating my pictures merged with the Pixel Art style.
prompt = "pixel, a photo of rraj person wearing sunglasses"
Limitations ↩
Generating faces is tough, sometimes eyes and teeth are not rendered properly or could be mismatches. For instance, below, my eyes are rendered in green but are black in the training images.
The GPU did not allow me to fine-tune the text-encoder (2 text-encoders in case of SDXL). Fine-tuning text encoders certainly improves image generation quality.
References ↩
[2] "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation"