index.html

<!DOCTYPE html>
<html>
  <head>
    <title>Make-An-Audio 2</title>
    <link
      href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css"
      rel="stylesheet"
    />
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>
    <script src="helper.js" defer></script>
    <style>
      td {
        vertical-align: middle;
      }
      audio {
        width: 20vw;
        min-width: 100px;
        max-width: 250px;
      }
    </style>
  </head>
  <body>
    <div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
      <div class="text-center">
        <h1>Make-An-Audio 2</h1><h3>Temporal-Enhanced Text-to-Audio Generation</h3>
        <p class="lead fw-bold">
          |<a
            href="https://arxiv.org/abs/2305.18474"
            class="btn border-white bg-white fw-bold"
            >paper</a
          >|<a
          href="https://github.com/bytedance/Make-An-Audio-2"
          class="btn border-white bg-white fw-bold"
          >code</a
          >|
        </p>
        <p class="fst-italic mb-0">
          <span class="author-block">
            <a href="https://scholar.google.com/citations?user=wDgSBssAAAAJ&hl=en">Jiawei Huang</a><sup>1,2,*</sup>,</span>
            <a href="https://rayeren.github.io/">Yi Ren</a><sup>2,*</sup>,
          </span>
          <span class="author-block">
            <a href="https://scholar.google.com/citations?user=iRHBUsgAAAAJ&hl=zh-CN&oi=ao">Rongjie Huang</a><sup>1</sup>,
          </span>
          <span class="author-block">
            <a href="https://scholar.google.com/citations?user=WNiojyAAAAAJ&hl=zh-CN&oi=ao">Dongchao Yang</a><sup>3</sup>,
          </span>
          <span class="author-block">
            <a href="https://scholar.google.com/citations?hl=en&user=I1XtkC4AAAAJ">Zhenhui Ye</a><sup>1,2</sup>,
          </span>
          <span class="author-block">
            <a href="https://scholar.google.com/citations?hl=en&user=eBBFeVcAAAAJ">Chen Zhang</a><sup>2</sup>,
          </span>
          <span class="author-block">
            <a href="https://silentlin15.github.io/">Jinglin Liu</a><sup>2,*</sup>,</span>
          <span class="author-block">
          <span class="author-block">
            <a href="#">Xiang Yin</a><sup>2</sup>
          </span>
          <span class="author-block">
            <a href="#">Zejun Ma</a><sup>2</sup>
          </span>
          <span class="author-block">
            <a href="https://scholar.google.com/citations?hl=en&user=IIoFY90AAAAJ">Zhou Zhao</a><sup>1</sup>
          </span>
        </p>
        <p><b></b></p>
      </div>
      <div style="text-align: center">
        <span class="author-block"><sup>1</sup>Zhejiang University,</span>
        <span class="author-block"><sup>2</sup>ByteDance</span>
        <span class="author-block"><sup>3</sup>The Chinese University of HongKong</span>
      </div>
      <div style="text-align: center">
        <span class="author-block"><sup>*</sup>Equal Contribution</span>
      </div>
      <p>
        <b>Abstract.</b>  
        Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event \& order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.
      </p>
    </div>

<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
    <h2 id="model-overview" style="text-align: center;">Make-An-Audio 2 Overview</h2>
    <body>
    <p style="text-align: center;">
        <img src="arch.png" height="200" width="800" class="img-fluid">
    </p>
    </body>
        <p>
            High-level overview of Make-An-Audio 2. Note that modules printed with a lock are
            frozen when training the T2A model. The Text Encoder takes original natural language text as input. And the Temporal Encoder takes the LLMs-parsed structured caption as its input. The structured inputs are parsed before the training process.
            The training process are seperated into two stages. The first stage we train the Audio VAE. The second stage we train the T2A diffusion module and freeze the parameters of both Audio VAE and Text Encoder.
        </p>
</div>

<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
    <h2 id="model-overview" style="text-align: left;">Table of Contents</h2>
    <body>
    <p style="text-align: left;">
    <ul style="list-style: outside none none !important;">
       <li><a href="#efficiency" class="btn border-white bg-white fw-bold">Text-to-Audio generation</a></li>
       <li><a href="#diversity" class="btn border-white bg-white fw-bold">Variable-length audio generation</a></li>
       <li><a href="#prompting" class="btn border-white bg-white fw-bold">Precise Temporal Control</a></li>
       <li><a href="#dual_comparison" class="btn border-white bg-white fw-bold">Comparison between dual text encoders and only structured text encoder</a></li>
       <li><a href="#impact" class="btn border-white bg-white fw-bold">Broader Impact</a></li>
    </ul>
    </p>
    </body>
</div>


    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Text-to-Audio generation<a id="efficiency"/></h3>
      <p class="mb-0">
        We show the original natural language caption and the corresponding structured caption of Make-An-Audio 2. And we compare the audio generated by Make-An-Audio 2 to prior T2A works.
      </p>
      <div class="container pt-3 table-responsive">
        <table
          class="table table-hover"
          id="supervision-efficiency-table">
          <thead>
            <tr>
              <th style="text-align: center" > &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspInput&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp</th>
              <th style="text-align: center">Ground-truth</th>
              <th style="text-align: center">Make-An-Audio 2</th>
              <th style="text-align: center">Make-An-Audio</th>
              <th style="text-align: center">Audio-LDM</th>
              <th style="text-align: center">TANGO</th>
            </tr>
          </thead>
          <tbody>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
          </tbody>
        </table>
      </div>


    </div>


    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Variable-length audio generation<a id="diversity"/></h3>
      <p class="mb-0">
      Trained with variable length data and with the design of 1D-convlution VAE and feed-forward Transformer-based diffusion backbone, Make-An-Audio 2 can generate audios of variable-length without performance dropping.
      </p>
      <div class="container pt-3 table-responsive">
        <table
          class="table table-hover"
          id="speech-diversity"
        >
          <thead>
            <tr>
              <th width="40%">&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspInput&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp</th>
              <th width="15%">Make-An-Audio 2</th>
              <th width="15%">Make-An-Audio</th>
              <th width="15%">AudioLDM</th>
              <th width="15%">TANGO</th>
            </tr>
          </thead>
          <tbody>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
             <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
          </tbody>
        </table>
      </div>
    </div>

    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Precise Temporal Control<a id="prompting"/></h3>
      <p class="mb-0">
        Due to the ambiguity of natural language, the time period when some sound events occur may not be clearly described, and we can provide more precise temporal control by modifying the order in the structured input.
      </p>
      <div class="container pt-3 table-responsive">
        <table
          class="table table-hover"
          id="prompting-table"
        >
          <thead>
            <tr>
              <th width="20%">Origin Input</th>
              <th width="20%">Structured Input</th>
              <th width="20%">Generated Audio</th>
              <th width="20%">Structured Input</th>
              <th width="20%">Generated Audio</th>
            </tr>
          </thead>
          <tbody>
             <tr>
             </tr>
             <tr> <td><font size="-1">Wind blowing followed by people speaking then a loud burst of thunder</font></td> 
                <td><font size="-1">&ltwind blowing& all&gt@&ltpeople speaking& mid&gt@&ltthunder& end&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Wind-blowing-followed-by-people-speaking-then-a-loud-burst-of-thunder_wind blowing& all_@_people speaking& mid_@_thunder& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
                <td><font size="-1">&ltwind blowing& start&gt@&ltpeople speaking& mid&gt@&ltthunder& end&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Wind-blowing-followed-by-people-speaking-then-a-loud-burst-of-thunder_wind blowing& start_@_people speaking& mid_@_thunder& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
             <tr> <td><font size="-1">A train running on railroad tracks followed by a train horn blowing and steam hissing</font></td> 
                <td><font size="-1">&lttrain running on railroad tracks& all&gt@&lttrain horn blowing& end&gt@&ltsteam hissing& end&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\A-train-running-on-railroad-tracks-followed-by-a-train-horn-blowing-and-steam-hissing_train running on railroad tracks& all_@_train horn blowing& end_@_steam hissing& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
                <td><font size="-1">&lttrain running on railroad tracks& all&gt@&lttrain horn blowing& mid&gt@&ltsteam hissing& end&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\A-train-running-on-railroad-tracks-followed-by-a-train-horn-blowing-and-steam-hissing_train running on railroad tracks& all_@_train horn blowing& mid_@_steam hissing& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
             <tr> <td><font size="-1">Winds and ocean waves crashing while a chime instrument briefly plays a melody</font></td> 
                <td><font size="-1">&ltwinds& all&gt@&ltocean waves crashing& all&gt@&ltchime instrument melody& all&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Winds-and-ocean-waves-crashing-while-a-chime-instrument-briefly-plays-a-melody_winds& all_@_ocean waves crashing& all_@_chime instrument melody& all_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
                <td><font size="-1">&ltwinds& all&gt@&ltocean waves crashing& all&gt@&ltchime instrument melody& mid&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Winds-and-ocean-waves-crashing-while-a-chime-instrument-briefly-plays-a-melody_winds& all_@_ocean waves crashing& all_@_chime instrument melody& mid_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
            <tr> <td><font size="-1">Constant faint humming and a few light knocks</font></td> 
              <td><font size="-1">&ltconstant faint humming & all&gt@&lta few light knocks & end&gt</font></td> 
              <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Constant-faint-humming-and-a-few-light-knocks_constant faint humming& all_@_a few light knocks& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
              <td><font size="-1">&ltconstant faint humming & all&gt@&lta few light knocks & start&gt</font></td> 
              <td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Constant-faint-humming-and-a-few-light-knocks_constant faint humming& all_@_a few light knocks& start_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
          </tbody>
        </table>
      </div>
    </div>

    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Comparison between dual text encoders and only structured text encoder<a id="dual_comparison"></h3>
      <p class="mb-0">
        When LLM parsing the original natural language input, some adjective or quantifier may be lost, and sometimes the structured inputs' format is incorrect. Dual text encoders can avoid information loss and are more robust in these situations. 
      </p>
      <div class="container pt-3 table-responsive">
        <table
          class="table table-hover"
          id="prompting-table">
          <thead>
            <tr>
              <th width="25%">Origin Input</th>
              <th width="25%">Wrongly Structured Input</th>
              <th width="25%">Dual text encoders</th>
              <th width="25%">Only structured text encoder</th>
            </tr>
          </thead>
          <tbody>
             <tr>
             </tr>
             <tr> <td><font size="-1">A strong torrent of rain is audible outside of a window</font></td> 
                <td><font size="-1">&ltstrong&gtSound of strong torrent of rain outside window & all&lt/strong&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-strong-torrent-of-rain-is-audible-outside-of-a-window.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-strong-torrent-of-rain-is-audible-outside-of-a-window.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
              <tr> <td><font size="-1">A motorcycle revving by quickly twice</font></td> 
                  <td><font size="-1">&ltmotorcycle revving & all&gt@&ltquickly twice & end&gt&lt/quickly&gt</font></td> 
                  <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-motorcycle-revving-by-quickly-twice.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
                  <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-motorcycle-revving-by-quickly-twice.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
            <tr> <td><font size="-1">A car moves quickly and is followed by someone walking and other cars</font></td> 
                <td><font size="-1">&ltcar engine revving & start&gt@&ltcar tires screeching & mid&gt@&ltfootsteps running & mid&gt@&ltother car engines & mid to end&gt</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-car-moves-quickly-and-is-followed-by-someone-walking-and-other-cars.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-car-moves-quickly-and-is-followed-by-someone-walking-and-other-cars.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
             <tr> <td><font size="-1">A metallic swirling and scraping that gets louder and more irregular</font></td> 
                <td><font size="-1">&ltmetallic swirling and scraping & all, getting louder and more irregular&gt@</font></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-metallic-swirling-and-scraping-that-gets-louder-and-more-irregular.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
                <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-metallic-swirling-and-scraping-that-gets-louder-and-more-irregular.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
            <tr> <td><font size="-1">A gusting wind with waves crashing in the background from time to time</font></td> 
              <td><font size="-1">&ltgusting wind & all&gt@&ltwaves crashing & random intervals&gt</font></td> 
              <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-gusting-wind-with-waves-crashing-in-the-background-from-time-to-time.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> 
              <td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-gusting-wind-with-waves-crashing-in-the-background-from-time-to-time.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
          </tbody>
        </table>
      </div>
    </div>

    <div class="container shadow p-5 mb-5 bg-white rounded">
      <h3>Broader impact<a id="impact"/></h3>
      <p class="mb-0">
        We believe that our T2A work on temporal enhancement can serve as an important stepping stone for future work on generating semantically aligned and temporally consistent audio.
        And our approach of constructing complex audio and enhancing the data based on LLM can provide inspiration for future work.
        <br/>
        At the same time, we acknowledge that Make-An-Audio 2 may lead to unintended consequences such as increased unemployment for individuals in related fields such as sound engineering and radio hosting. Furthermore, there are potential concerns regarding the ethics of non-consensual voice cloning or the creation of fake media.
      </p>
      <br/>
    </div>


  </body>
</html>