layout | background-class | body-class | title | summary | category | image | author | tags | github-link | github-id | featured_image_1 | featured_image_2 | accelerator | order | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hub_detail |
hub-background |
hub |
Automodulator |
Generative autoencoder for scale-specific fusion of multiple input face images. |
researchers |
automodulator1.png |
Ari Heljakka |
|
AaltoVision/automodulator |
automodulator2.png |
no-image |
cuda |
10 |
import torch
model = torch.hub.load('AaltoVision/automodulator:hub', 'ffhq512', pretrained=True, force_reload=True, source='github')
model.eval(useLN=False)
Loads the automodulator [1] model for 512x512 faces (trained on FFHQ [2]).
Scale-specific mixing of multiple real input images is now a breeze, see below.
For the basic workflow, you load in N images and encode them into [N,512]
latent vector with model.encode(imgs)
.
To sanity check, you can reconstruct them back into images by model.decode(zz)
where zz
can be a single-image latent or
an instance of model.zbuilder()
which can mix the original latents in arbitrary ways.
# Preliminaries
import sys
sys.argv = ['none'] # For Jupyter/Colab only
import torch
from torchvision.utils import make_grid
import matplotlib.pyplot as plt
import urllib
from PIL import Image
def show(img):
nrow = max(2, (len(img)+1)//2)
ncol = min(2, (len(img)+1)//2)
img = make_grid(img, nrow=nrow, scale_each=True, normalize=True)
plt.figure(figsize=(4*nrow,4*ncol))
plt.imshow(img.permute(1, 2, 0).cpu().numpy())
Load images and reconstruct (replace URLs with your own):
simg = ['https://github.com/AaltoVision/automodulator/raw/hub/fig/source-0.png',
'https://github.com/AaltoVision/automodulator/raw/hub/fig/source-1.png']
imgs = torch.stack([model.tf()(Image.open(urllib.request.urlopen(simg[0]))),
model.tf()(Image.open(urllib.request.urlopen(simg[1])))]).to('cuda')
with torch.no_grad():
z = model.encode(imgs)
omgs = model.decode(z).clamp(min=-1, max=1)
# OR: omgs = model.reconstruct(imgs).clamp(min=-1, max=1)
for (i,o) in zip(imgs,omgs):
show([i, o])
Start mixing. For instance, drive the coarse features (4x4 to 8x8) of the bottom-left image BY the top-right:
mixed = model.decode(model.zbuilder().hi(z[1])
.mid(z[0])
.lo(z[0]))
# Equivalent to: model.zbuilder().use(z[1],[0,2]).use(z[0],[2,5]).use(z[0],[5,-1]))
show([torch.ones_like(imgs[0]), imgs[1], imgs[0], mixed[0]])
You can use either the shorthand model.zbuilder().hi(z[i])
etc. or the lower-level model.zbuilder().use(z[i], [first_block, last_block])
where last_block = -1
denotes the rest of the blocks.
You can also do random sampling:
with torch.no_grad():
random_imgs = model.generate(2).cpu()
show(random_imgs[0].unsqueeze(0))
Or, you can generate random samples conditioned on the specific-scale features of your input image:
with torch.no_grad():
mixed = model.decode(model.zbuilder(batch_size=6).mid(z[1]).lo(z[1]))
show(mixed)
For unsupervised attribute manipulation with features more specific than what you get with style mixing, you can use exemplars of your own (just find the average differences of latent codes on opposing exemplars) or you can use the pre-calculated ones (from 2x16 exemplars) as below. The larger the exemplar set, the better the attribute vector, but you can get by with as little as two opposing sets of 1 to 4 exemplars each.
urllib.request.urlretrieve('https://github.com/AaltoVision/automodulator/raw/master/pioneer/attrib/smile_delta512-16', 'smile_delta512-16')
smile_delta = torch.load('smile_delta512-16')
#OR: my_attribute_delta = (model.encode(imgs_with_attr) - model.encode(imgs_without_attr)).mean(0) # yields 512-d latent (difference) vector
z_add_smile = z[0].unsqueeze(0) + 0.5*smile_delta
z_no_smile = z[0].unsqueeze(0) - 1.5*smile_delta
with torch.no_grad():
mod = model.decode(model.zbuilder().hi(z[0]).mid(z_no_smile).lo(z[0]))
show([imgs[0], mod[0]])
The model incorporates the encoder-decoder architecture of Deep Automodulators [1] trained on FFHQ [2] up to 512x512.
It allows for instant style mixing of real input images, as well as generating random samples such that their properties on certain scales are fixed on a specific input image.
Input images are expected to be centered and aligned as in FFHQ (script). The model.tf()
then provides the sufficient pre-inference transformations.
The model 'ffhq512' is recommended for all face data modification tasks. The other models in the paper have only been optimized for random sampling.
[1] Heljakka, A., Hou, Y., Kannala, J., and Solin, A. (2020). Deep Automodulators. In Advances in Neural Information Processing Systems (NeurIPS). [arXiv preprint].
[2] Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4401–4410, 2019.