index.html

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>ControlNet-XS</title>
<!-- <link href="style.css" rel="stylesheet"> -->
<link href="./ControlNet-XS_files/style.css" rel="stylesheet" >
</head>

<body>
<div class="content">
  <h1><strong>ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems</strong></h1>
  <h2 style="text-align: center;"><strong>ECCV 2024 (Oral)</strong></h2>
  <p id="authors">Denis Zavadski, Johann-Friedrich Feiden, <a href="https://hci.iwr.uni-heidelberg.de/vislearn/people/carsten-rother/">Carsten Rother</a><br>
    <br>
  <span style="font-size: 24px">Computer Vision and Learning Lab, IWR, Heidelberg University
  </span></p>
  <br>
  
  <img src="./ControlNet-XS_files/teaser_small.gif" class="teaser-gif" style="width:100%;"><br>

  <h3 style="text-align:center"><em></em></h3>
    <font size="+2">
          <p style="text-align: center;">
            <a href="https://arxiv.org/abs/2312.06573" target="_blank">[Paper]</a> &nbsp;&nbsp;&nbsp;&nbsp;
            <a href="https://github.com/vislearn/ControlNet-XS" target="_blank">[GitHub Code]</a> &nbsp;&nbsp;&nbsp;&nbsp;
            <a href="https://huggingface.co/CVL-Heidelberg/ControlNet-XS" target="_blank">[Pretrained Models]</a> &nbsp;&nbsp;&nbsp;&nbsp;
          </p>
    </font>
</div>

<div class="content">
  <img src="./ControlNet-XS_files/banner_image.png" class="summary-img"  style="width:100%;"><br>
</div>

<div class="content">
  <h2 style="text-align:center;">Overview</h2>
  <p>
    With increasing computing capabilities, current model architectures appear to
follow the trend of simply upscaling all components without validating the necessity
for doing so. In this project, we investigate the size and architectural design of
ControlNet [Zhang et al., 2023] for controlling the image generation process with
stable diffusion-based models. We show that a new architecture with as little as
1% of the parameters of the base model achieves state-of-the art results and performs considerably better than ControlNet in terms of FID score.
Hence, we call it ControlNet-XS. We provide the code for controlling
StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 1.5 [Rombach et al. 2022] (Model B,
14M Parameters) and StableDiffusion 2.1 (Model B, 14M Parameters), all under openrail license. The different models are explained in the Method section below.
  </p>
</div>


<div class="content">
  <h2>StableDiffusion-XL Control</h2>
  <p>
    We evaluate differently sized control models and confirm that the size does not even
have to be of the same magnitude as the base U-Net network, which has 2.6B paramaters.
The control is evident for sizes of ControlNet-XS of 400M, 104M and 48M parameters,
as shown below for guidance with depth maps (MiDaS [Ranftl et al., 2020]) and Canny edges,
respectively. A row shows three example results of Model B, each with a different seed. Note, we use the same seed for each column.

  </p>
  <br>
  <img class="summary-img" src="./ControlNet-XS_files/sdxl_midas.jpg" style="width:100%;"> <br>
  <img class="summary-img" src="./ControlNet-XS_files/sdxl_canny.jpg" style="width:100%;"> <br>
</div>


<div class="content">
  <h2>StableDiffusion Control</h2>
  <p>

  We show generations of three versions of ControlNet-XS with 491M, 55M and 14M parameters respectively.
   We control Stable Diffusion with depth maps (MiDaS) and Canny edges.
  Even the smallest model with 1.6% of the base model size, which has 865M parameters, is able to
  reliably guide the generation process.
  As above, a row shows three example results of Model B, each with a different seed. Note, we use the same seed for each column.

  </p>
  <br>

  <br>
  <p></p>
  <img class="summary-img" src="./ControlNet-XS_files/sd_midas.png" style="width:100%;"> <br>
  <img class="summary-img" src="./ControlNet-XS_files/sd_canny.png" style="width:100%;"> <br>
</div>

<div class="content">
  <h2>Method</h2>
  <p> 
    The original ControlNet is a copy of the U-Net encoder in the StableDiffusion base model, and hence receives
the same input as the base model with an additional guidance signal like an edge map. The intermediate
outputs of the trained ControlNet are then added to the inputs of the decoder layers of the base model.
Throughout the training process of ControlNet, the weights of the base model are kept frozen. We identify
several conceptual issues with such an approach leading to an unnecessarily large ControlNet and to a
significant reduction in quality of the generated image:
  </p> 
    <ul type="i">
      <li>
        The final output image of stable diffusion, which we call the base model, is generated iteratively in a series of time steps.
        At each time step, a U-Net with an encoder and decoder is executed as illustrated below.
        At each iteration, the input to the base model and the control model is the generated image
        of the previous time step. The control model additionally receives a control image. The problem is that
        in the encoder phase, both models operate independently, and the feedback from the control model
        enters only in the decode phase of the base model. The result is a delayed correction/controlling mechanism, 
        and it implies that the ControlNet has to do two jobs. Instead of solely focusing all network 
        capacity on correction/controlling, ControlNet has to additionally anticipate in advance
        what "mistakes" the Encoder of the base model is going to make.
      </li>
      <li>
        By implying that image generation and controlling require similar model capacities, it is natural to initialize
        the weights of ControlNet with the weights of the base model and then fine-tuning them. With our ControlNet-XS
        we diverge in design from the base model, and hence train the weights of ControlNet-XS from scratch.
      </li>
    </ul>
  <p>
    We address the first problem (i) of delayed feedback by adding connections from the Encoder base model into the controlling
    Encoder (A). In this way, the corrections can adapt more quickly to the generation process of the based model.
    Nonetheless, it does not eliminate the delay entirely, since the encoder of the base model still remains unguided.
    Hence, we add additional connections from ControlNet-XS into the base model encoder, directly influencing the entire
    generative process (B). For completeness, we evaluate if there is any benefit in using a
    mirrored, decoding architecture in the ControlNet setup (C).
  </p>
  <br>
  <img class="summary-img" src="./ControlNet-XS_files/method.png" style="width:100%;"> <br>
</div>




<div class="content">
  <h2>Size and FID-Score Comparison</h2>
  <p>
    We evaluate the performance of three variations (A, B, C) for Canny
    edge guidance in comparison to the original ControlNet in terms of FID-score over the validation set of
    COCO2017 [Lin et al., 2014]. All of our variations achieve a significant improvement, while having just a fraction of the
    parameters of the original ControlNet.
  </p>

  <img class="summary-img" src="./ControlNet-XS_files/fid_versions.png" style="width:60%;">

  <p>
    
    We focus our attention on variant B and train it with different model sizes for canny and depth map
    guidance, respectively, for StableDiffusion 1.5, StableDiffusion 2.1 and the current StableDIffusion-XL version.    
  </p>
  <img class="summary-img" src="./ControlNet-XS_files/fid_comparison.png" style="width:55%;">
</div>

<div class="content">
  <h2>BibTex</h2>
  <code>
  @misc{zavadski2024controlnetxs,<br>
    &nbsp;&nbsp;title={ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems}, <br>
    &nbsp;&nbsp;author={Denis Zavadski and Johann-Friedrich Feiden and Carsten Rother},<br>
    &nbsp;&nbsp;year={2024},<br>
    &nbsp;&nbsp;eprint={2312.06573},<br>
    &nbsp;&nbsp;archivePrefix={arXiv},<br>
    &nbsp;&nbsp;primaryClass={cs.CV},<br>
}
</code>
</div>

<div class="content">
  <h2>References</h2>
  <dl>
    <dt>[René et al., 2020]</dt>
    <dd>
      René Ranftl, Katrin Lasinger, David Hafner, Konrad
      Schindler, and Vladlen Koltun. Towards robust monocular
      depth estimation: Mixing datasets for zero-shot cross-dataset
      transfer. IEEE Transactions on Pattern Analysis and Machine
      Intelligence, 44(3):1623–1637, 2020
    </dd> <br>
    
    <dt>[Rombach et all., 2022]</dt>
    <dd>
      Robin Rombach, Andreas Blattmann, Dominik Lorenz,
      Patrick Esser, and Björn Ommer. High-resolution image
      synthesis with latent diffusion models. In Proceedings of
      the IEEE/CVF Conference on Computer Vision and Pattern
      Recognition, pages 10684–10695, 2022
    </dd><br>

    <dt>[Podell et al., 2023]</dt>
    <dd>
      Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis.
      CoRR, abs/2307.01952, 2023.
    </dd><br>

    <dt>[Zhang et al., 2023]</dt>
    <dd>
      Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
    </dd><br>

    <dt>[Lin et al., 2014]</dt>
    <dd>
      Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV 2014.
    </dd><br>

    <dt>[Zhao et al., 2023]</dt>
    <dd>
      Zhao, Shihao and Chen, Dongdong and Chen, Yen-Chun and Bao, Jianmin and Hao, Shaozhe and Yuan, Lu and Wong, Kwan-Yee K. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2305.16322
    </dd><br>
  </dl>
</div>
</body>
</html>