-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathindex.html
206 lines (178 loc) · 9.89 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>ControlNet-XS</title>
<!-- <link href="style.css" rel="stylesheet"> -->
<link href="./ControlNet-XS_files/style.css" rel="stylesheet" >
</head>
<body>
<div class="content">
<h1><strong>ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems</strong></h1>
<h2 style="text-align: center;"><strong>ECCV 2024 (Oral)</strong></h2>
<p id="authors">Denis Zavadski, Johann-Friedrich Feiden, <a href="https://hci.iwr.uni-heidelberg.de/vislearn/people/carsten-rother/">Carsten Rother</a><br>
<br>
<span style="font-size: 24px">Computer Vision and Learning Lab, IWR, Heidelberg University
</span></p>
<br>
<img src="./ControlNet-XS_files/teaser_small.gif" class="teaser-gif" style="width:100%;"><br>
<h3 style="text-align:center"><em></em></h3>
<font size="+2">
<p style="text-align: center;">
<a href="https://arxiv.org/abs/2312.06573" target="_blank">[Paper]</a>
<a href="https://github.com/vislearn/ControlNet-XS" target="_blank">[GitHub Code]</a>
<a href="https://huggingface.co/CVL-Heidelberg/ControlNet-XS" target="_blank">[Pretrained Models]</a>
</p>
</font>
</div>
<div class="content">
<img src="./ControlNet-XS_files/banner_image.png" class="summary-img" style="width:100%;"><br>
</div>
<div class="content">
<h2 style="text-align:center;">Overview</h2>
<p>
With increasing computing capabilities, current model architectures appear to
follow the trend of simply upscaling all components without validating the necessity
for doing so. In this project, we investigate the size and architectural design of
ControlNet [Zhang et al., 2023] for controlling the image generation process with
stable diffusion-based models. We show that a new architecture with as little as
1% of the parameters of the base model achieves state-of-the art results and performs considerably better than ControlNet in terms of FID score.
Hence, we call it ControlNet-XS. We provide the code for controlling
StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 1.5 [Rombach et al. 2022] (Model B,
14M Parameters) and StableDiffusion 2.1 (Model B, 14M Parameters), all under openrail license. The different models are explained in the Method section below.
</p>
</div>
<div class="content">
<h2>StableDiffusion-XL Control</h2>
<p>
We evaluate differently sized control models and confirm that the size does not even
have to be of the same magnitude as the base U-Net network, which has 2.6B paramaters.
The control is evident for sizes of ControlNet-XS of 400M, 104M and 48M parameters,
as shown below for guidance with depth maps (MiDaS [Ranftl et al., 2020]) and Canny edges,
respectively. A row shows three example results of Model B, each with a different seed. Note, we use the same seed for each column.
</p>
<br>
<img class="summary-img" src="./ControlNet-XS_files/sdxl_midas.jpg" style="width:100%;"> <br>
<img class="summary-img" src="./ControlNet-XS_files/sdxl_canny.jpg" style="width:100%;"> <br>
</div>
<div class="content">
<h2>StableDiffusion Control</h2>
<p>
We show generations of three versions of ControlNet-XS with 491M, 55M and 14M parameters respectively.
We control Stable Diffusion with depth maps (MiDaS) and Canny edges.
Even the smallest model with 1.6% of the base model size, which has 865M parameters, is able to
reliably guide the generation process.
As above, a row shows three example results of Model B, each with a different seed. Note, we use the same seed for each column.
</p>
<br>
<br>
<p></p>
<img class="summary-img" src="./ControlNet-XS_files/sd_midas.png" style="width:100%;"> <br>
<img class="summary-img" src="./ControlNet-XS_files/sd_canny.png" style="width:100%;"> <br>
</div>
<div class="content">
<h2>Method</h2>
<p>
The original ControlNet is a copy of the U-Net encoder in the StableDiffusion base model, and hence receives
the same input as the base model with an additional guidance signal like an edge map. The intermediate
outputs of the trained ControlNet are then added to the inputs of the decoder layers of the base model.
Throughout the training process of ControlNet, the weights of the base model are kept frozen. We identify
several conceptual issues with such an approach leading to an unnecessarily large ControlNet and to a
significant reduction in quality of the generated image:
</p>
<ul type="i">
<li>
The final output image of stable diffusion, which we call the base model, is generated iteratively in a series of time steps.
At each time step, a U-Net with an encoder and decoder is executed as illustrated below.
At each iteration, the input to the base model and the control model is the generated image
of the previous time step. The control model additionally receives a control image. The problem is that
in the encoder phase, both models operate independently, and the feedback from the control model
enters only in the decode phase of the base model. The result is a delayed correction/controlling mechanism,
and it implies that the ControlNet has to do two jobs. Instead of solely focusing all network
capacity on correction/controlling, ControlNet has to additionally anticipate in advance
what "mistakes" the Encoder of the base model is going to make.
</li>
<li>
By implying that image generation and controlling require similar model capacities, it is natural to initialize
the weights of ControlNet with the weights of the base model and then fine-tuning them. With our ControlNet-XS
we diverge in design from the base model, and hence train the weights of ControlNet-XS from scratch.
</li>
</ul>
<p>
We address the first problem (i) of delayed feedback by adding connections from the Encoder base model into the controlling
Encoder (A). In this way, the corrections can adapt more quickly to the generation process of the based model.
Nonetheless, it does not eliminate the delay entirely, since the encoder of the base model still remains unguided.
Hence, we add additional connections from ControlNet-XS into the base model encoder, directly influencing the entire
generative process (B). For completeness, we evaluate if there is any benefit in using a
mirrored, decoding architecture in the ControlNet setup (C).
</p>
<br>
<img class="summary-img" src="./ControlNet-XS_files/method.png" style="width:100%;"> <br>
</div>
<div class="content">
<h2>Size and FID-Score Comparison</h2>
<p>
We evaluate the performance of three variations (A, B, C) for Canny
edge guidance in comparison to the original ControlNet in terms of FID-score over the validation set of
COCO2017 [Lin et al., 2014]. All of our variations achieve a significant improvement, while having just a fraction of the
parameters of the original ControlNet.
</p>
<img class="summary-img" src="./ControlNet-XS_files/fid_versions.png" style="width:60%;">
<p>
We focus our attention on variant B and train it with different model sizes for canny and depth map
guidance, respectively, for StableDiffusion 1.5, StableDiffusion 2.1 and the current StableDIffusion-XL version.
</p>
<img class="summary-img" src="./ControlNet-XS_files/fid_comparison.png" style="width:55%;">
</div>
<div class="content">
<h2>BibTex</h2>
<code>
@misc{zavadski2024controlnetxs,<br>
title={ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems}, <br>
author={Denis Zavadski and Johann-Friedrich Feiden and Carsten Rother},<br>
year={2024},<br>
eprint={2312.06573},<br>
archivePrefix={arXiv},<br>
primaryClass={cs.CV},<br>
}
</code>
</div>
<div class="content">
<h2>References</h2>
<dl>
<dt>[René et al., 2020]</dt>
<dd>
René Ranftl, Katrin Lasinger, David Hafner, Konrad
Schindler, and Vladlen Koltun. Towards robust monocular
depth estimation: Mixing datasets for zero-shot cross-dataset
transfer. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 44(3):1623–1637, 2020
</dd> <br>
<dt>[Rombach et all., 2022]</dt>
<dd>
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10684–10695, 2022
</dd><br>
<dt>[Podell et al., 2023]</dt>
<dd>
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis.
CoRR, abs/2307.01952, 2023.
</dd><br>
<dt>[Zhang et al., 2023]</dt>
<dd>
Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
</dd><br>
<dt>[Lin et al., 2014]</dt>
<dd>
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV 2014.
</dd><br>
<dt>[Zhao et al., 2023]</dt>
<dd>
Zhao, Shihao and Chen, Dongdong and Chen, Yen-Chun and Bao, Jianmin and Hao, Shaozhe and Yuan, Lu and Wong, Kwan-Yee K. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2305.16322
</dd><br>
</dl>
</div>
</body>
</html>