This is our final Master's thesis for the Data Science MSc program at the Barcelona School of Economics (BSE). The goal is to create variations of a new character designed by an artist using computational image generation. We collaborated with Jordane Meignaud (Instagram: @surunnuagecreation), who invented a character and provided six unique images of this character. All rights regarding the image belong to Jordane Meignaud, and any other uses of these images must be approved by her. Using these six images (or fewer) and various models, particularly Diffusion Models, we show that fine-tuning is the best method for addressing data scarcity and generating new poses for the character.
For any additional questions, feel free to reach out to the authors of this project:
- Maëlys Boudier (maelys.boudier@bse.eu)
- Natalia Beltrán (natalia.beltran@bse.eu)
- Arianna Michelangelo (arianna.michelangelo@bse.eu)
Set-Up Model and Dataset on Hugging Face: To facilitate saving model weights, we created an account on Hugging Face (which is free) and created a "New Model" with the name of our choice. Whenever we ran codes we could manually upload weights to this model space or sometimes directly save it in the hugging face directory as opposed to our local directory. We also set up a "New Dataset" with our training images, this is a useful alternative to using the images on the local directory as we tested many methods to run our codes (Google Colab, different Local Computers, Kaggle). Having a dataset on Hugging Face meant that we could also directly access our dataset without having to update file paths.
Secret Key on Hugging Face: We chose to have private models and datasets on Hugging Face. As such, we were able to access them using a secret key called "Access Tokens" in the account parameters on Hugging Face. Make sure that you use a token of type "WRITE" if you intend on directly saving your models to the model directory on Hugging Face.
Note: You can also save your keys in Kaggle in a hidden User Secrets and load them secretly so you avoid accidentally sharing your private secret keys.
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("HF_TOKEN")
GPU on Kaggle: After struggling with GPU usage, we decided to run our computationally demanding codes on Kaggle which enables each user to have 30 hours a week of GPU usage. We primarily used the P100 GPU which made our LoRA and Dream Booth codes over 10 times faster than when running on our laptops.
GAN Model
We explored available literature and found limited research on comic character generation, particularly with scarce data from previous years. While Generative Adversarial Networks (GANs) have been used for image generation (refer to Marnix Verduyn's paper), they face significant challenges with less than 100k images. Specifically, the discriminator tends to overfit while the generator underfits (as discussed in the Nvidia blog paper).
-
Comic Art Generation using GANs by ir. Marnix Verduyn
Academic year 2021 – 2022
Read the paper -
NVIDIA Research Achieves AI Training Breakthrough Using Limited Datasets
December 7, 2020 by Isha Salian
Read the blog
Our attempt to generate a baseline using a GAN with just 6 images demonstrated the inadequacy of this approach. The model failed to converge, highlighting the necessity for alternative methods or larger datasets.
Diffusion Model
Our next attempt at establishing a baseline involved training a full diffusion model, which operates through a noising and denoising process. While this approach also failed to achieve conclusive results, it occasionally generated a few blue pixels similar to the colors in our training images. This indicated some potential but reamined inconclusive.
Stable Diffusion Model
Our final attempt at establishing a baseline involved using a fully trained Stable Diffusion Model with a few carefully crafted prompts to approximate our training images. This approach allowed us to evaluate the potential visual quality we could achieve, even though the generated images did not perfectly capture all the specific features of our training data.
DreamBooth Model
We began by implementing the DreamBooth training technique which enables us to teach a new concept to a Stable Diffusion model through fine-tuning. This method entails adjusting the weights of a complete diffusion model while training it on a small set of images alongside a text embedding. Essentially, the method operates by converting prompts into text embeddings, introducing noise to the images, and directing the model to denoise them based on the provided concept. Through an iterative refinement process, the model's structure is honed until it effectively grasps the association. Ultimately, this enables the model to recognize and link the unique identifier “UnicornGirl” from the prompt with the associated image data.
LoRA Model
Additionally, we implemented the Low-Rank Adaptation Technique (LoRA), which was developed to address the challenge of fine-tuning large language models. When applied in the context of Stable Diffusion, this technique focuses on adapting only certain parts of the neural network. LoRA gets applied to the cross-attention layers that link our image data with the textual prompts. This allows our diffusion model to recognize new words as distinct concepts, enhancing its performance without altering its underlying structure and existing knowledge, and without the need to retrain all the weights each time.
DreamBooth + LoRA Model
Lastly, we implemented a DreamBooth with LoRA fine-tuning, which offers notable advantages by incorporating additional trainable layers to the DreamBooth model without altering the original weights. During the fine-tuning process, both DreamBooth and LoRA weights are iteratively adjusted to better align with the targeted concept. DreamBooth weights are refined to enhance the model's capacity in associating the concept with the provided prompt and image data. Meanwhile, the LoRA weights are utilized to selectively adjust the significance of various features within the model, enabling it to focus more effectively on the nuances of the specific concept. Through this combined training process, the model progressively improves its ability to denoise images and associate the unique identifier with the represented concept.
Training Steps
Learning Rate
Inference Steps
For further information on the intricacies of any of the above techniques and findings, we invite you to read our comprehensive report. Access our documents here: 5. Documents
├── 1. Data
├── 2. Descriptive Statistics
│ ├── Image_DataOverview.ipynb
│ ├── Image_ColorBreakdown.ipynb
│ └── Image_HSV.ipynb
├── 3. Baseline Models
│ ├── GAN Models
│ │ ├── GAN_Model_Version1.ipynb
│ │ └── GAN_Model_Version2.ipynb
│ ├── Diffusion Models
│ │ ├── Diffusion_32x32.ipynb
│ │ └── Diffusion_256x256.ipynb
│ └── Stable-Diffusion-XL-Prompt.ipynb
├── 4. Fine Tuning Models
│ ├── DreamBooth
│ │ ├── DreamBooth.ipynb
│ │ ├── DreamBooth_Inference.ipynb
│ │ └── DreamBooth_GoogleColab.ipynb
│ ├── LoRA
│ │ ├── LoRA-Inference.ipynb
│ │ └── LoRA.ipynb
│ └── DreamBooth-LoRA
│ │ ├── DreamBooth-LoRA.ipynb
│ │ └── DreamBooth-LoRA-Inference.ipynb
├── 5. Documents
│ ├── Comic Story Board
│ ├── Comic Strip
│ ├── Presentation
│ └── Report
├── 6. Generated Images
│ ├── DreamBooth
│ ├── DreamBooth-LoRA
│ └── LoRA
└── README.md