Unofficial implementation of the Cones: Concept Neurons in Diffusion Models for Customized Generation paper.
Configure the provided colab notebook with your Dreambooth dataset. Set an appropriate threshold (activation_thresh) as per the number of steps in the cones_call, then run cones_call to generate concept neuron masks.
After concept masks have been computed, reload image pipeline back with pretrained weights (generating masks alters the weights), pass dictionaries of masks to the image pipeline in cones_inference to apply masks and generate images with the neuron masks.
AFAIK the paper does not provide any threshold values, so this is up for experimentation. I have found anything on the lower side of e-4 results in destruction of the attention layer, and results in just random noise like below:
Learning rate (rho) is set by default at 2e-5, which is apparently good for single subject learning.
I have not been able to reproduce the paper, albeit I have only computed masks with around 50 runs through the dataset (The researchers use 1000). A few images from the training dataset (20 prior images from the class dog, and 5 images of a dog as the concept) are below:
Here are a few generated images:
Not much of the concept was learnt.
The implementation is slow, it would be nice to have pointers on how to optimize it (This is my first paper implementation).
PRs are welcome!
Currently the method runs on colab GPUs with around 8-9 GB VRAM on fp16.
1.Restore default attention weights after each cones_inference
2.Get attention weights directly instead of looping over all Unet modules to find k/v layers, to reduce time for mask computation.
3.Look into implementing algorithm A2 on the paper (faster).