-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathindex.html
294 lines (279 loc) · 20.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
<!DOCTYPE html>
<html>
<head>
<title>Make-An-Audio 2</title>
<link
href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css"
rel="stylesheet"
/>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>
<script src="helper.js" defer></script>
<style>
td {
vertical-align: middle;
}
audio {
width: 20vw;
min-width: 100px;
max-width: 250px;
}
</style>
</head>
<body>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
<div class="text-center">
<h1>Make-An-Audio 2</h1><h3>Temporal-Enhanced Text-to-Audio Generation</h3>
<p class="lead fw-bold">
|<a
href="https://arxiv.org/abs/2305.18474"
class="btn border-white bg-white fw-bold"
>paper</a
>|<a
href="https://github.com/bytedance/Make-An-Audio-2"
class="btn border-white bg-white fw-bold"
>code</a
>|
</p>
<p class="fst-italic mb-0">
<span class="author-block">
<a href="https://scholar.google.com/citations?user=wDgSBssAAAAJ&hl=en">Jiawei Huang</a><sup>1,2,*</sup>,</span>
<a href="https://rayeren.github.io/">Yi Ren</a><sup>2,*</sup>,
</span>
<span class="author-block">
<a href="https://scholar.google.com/citations?user=iRHBUsgAAAAJ&hl=zh-CN&oi=ao">Rongjie Huang</a><sup>1</sup>,
</span>
<span class="author-block">
<a href="https://scholar.google.com/citations?user=WNiojyAAAAAJ&hl=zh-CN&oi=ao">Dongchao Yang</a><sup>3</sup>,
</span>
<span class="author-block">
<a href="https://scholar.google.com/citations?hl=en&user=I1XtkC4AAAAJ">Zhenhui Ye</a><sup>1,2</sup>,
</span>
<span class="author-block">
<a href="https://scholar.google.com/citations?hl=en&user=eBBFeVcAAAAJ">Chen Zhang</a><sup>2</sup>,
</span>
<span class="author-block">
<a href="https://silentlin15.github.io/">Jinglin Liu</a><sup>2,*</sup>,</span>
<span class="author-block">
<span class="author-block">
<a href="#">Xiang Yin</a><sup>2</sup>
</span>
<span class="author-block">
<a href="#">Zejun Ma</a><sup>2</sup>
</span>
<span class="author-block">
<a href="https://scholar.google.com/citations?hl=en&user=IIoFY90AAAAJ">Zhou Zhao</a><sup>1</sup>
</span>
</p>
<p><b></b></p>
</div>
<div style="text-align: center">
<span class="author-block"><sup>1</sup>Zhejiang University,</span>
<span class="author-block"><sup>2</sup>ByteDance</span>
<span class="author-block"><sup>3</sup>The Chinese University of HongKong</span>
</div>
<div style="text-align: center">
<span class="author-block"><sup>*</sup>Equal Contribution</span>
</div>
<p>
<b>Abstract.</b>
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event \& order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.
</p>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
<h2 id="model-overview" style="text-align: center;">Make-An-Audio 2 Overview</h2>
<body>
<p style="text-align: center;">
<img src="arch.png" height="200" width="800" class="img-fluid">
</p>
</body>
<p>
High-level overview of Make-An-Audio 2. Note that modules printed with a lock are
frozen when training the T2A model. The Text Encoder takes original natural language text as input. And the Temporal Encoder takes the LLMs-parsed structured caption as its input. The structured inputs are parsed before the training process.
The training process are seperated into two stages. The first stage we train the Audio VAE. The second stage we train the T2A diffusion module and freeze the parameters of both Audio VAE and Text Encoder.
</p>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
<h2 id="model-overview" style="text-align: left;">Table of Contents</h2>
<body>
<p style="text-align: left;">
<ul style="list-style: outside none none !important;">
<li><a href="#efficiency" class="btn border-white bg-white fw-bold">Text-to-Audio generation</a></li>
<li><a href="#diversity" class="btn border-white bg-white fw-bold">Variable-length audio generation</a></li>
<li><a href="#prompting" class="btn border-white bg-white fw-bold">Precise Temporal Control</a></li>
<li><a href="#dual_comparison" class="btn border-white bg-white fw-bold">Comparison between dual text encoders and only structured text encoder</a></li>
<li><a href="#impact" class="btn border-white bg-white fw-bold">Broader Impact</a></li>
</ul>
</p>
</body>
</div>
<div class="container shadow p-5 mb-5 bg-white rounded">
<h3>Text-to-Audio generation<a id="efficiency"/></h3>
<p class="mb-0">
We show the original natural language caption and the corresponding structured caption of Make-An-Audio 2. And we compare the audio generated by Make-An-Audio 2 to prior T2A works.
</p>
<div class="container pt-3 table-responsive">
<table
class="table table-hover"
id="supervision-efficiency-table">
<thead>
<tr>
<th style="text-align: center" >                   Input                       </th>
<th style="text-align: center">Ground-truth</th>
<th style="text-align: center">Make-An-Audio 2</th>
<th style="text-align: center">Make-An-Audio</th>
<th style="text-align: center">Audio-LDM</th>
<th style="text-align: center">TANGO</th>
</tr>
</thead>
<tbody>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr height=200px> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
</tbody>
</table>
</div>
</div>
<div class="container shadow p-5 mb-5 bg-white rounded">
<h3>Variable-length audio generation<a id="diversity"/></h3>
<p class="mb-0">
Trained with variable length data and with the design of 1D-convlution VAE and feed-forward Transformer-based diffusion backbone, Make-An-Audio 2 can generate audios of variable-length without performance dropping.
</p>
<div class="container pt-3 table-responsive">
<table
class="table table-hover"
id="speech-diversity"
>
<thead>
<tr>
<th width="40%">              Input                   </th>
<th width="15%">Make-An-Audio 2</th>
<th width="15%">Make-An-Audio</th>
<th width="15%">AudioLDM</th>
<th width="15%">TANGO</th>
</tr>
</thead>
<tbody>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
<tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr>
</tbody>
</table>
</div>
</div>
<div class="container shadow p-5 mb-5 bg-white rounded">
<h3>Precise Temporal Control<a id="prompting"/></h3>
<p class="mb-0">
Due to the ambiguity of natural language, the time period when some sound events occur may not be clearly described, and we can provide more precise temporal control by modifying the order in the structured input.
</p>
<div class="container pt-3 table-responsive">
<table
class="table table-hover"
id="prompting-table"
>
<thead>
<tr>
<th width="20%">Origin Input</th>
<th width="20%">Structured Input</th>
<th width="20%">Generated Audio</th>
<th width="20%">Structured Input</th>
<th width="20%">Generated Audio</th>
</tr>
</thead>
<tbody>
<tr>
</tr>
<tr> <td><font size="-1">Wind blowing followed by people speaking then a loud burst of thunder</font></td>
<td><font size="-1"><wind blowing& all>@<people speaking& mid>@<thunder& end></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Wind-blowing-followed-by-people-speaking-then-a-loud-burst-of-thunder_wind blowing& all_@_people speaking& mid_@_thunder& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><font size="-1"><wind blowing& start>@<people speaking& mid>@<thunder& end></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Wind-blowing-followed-by-people-speaking-then-a-loud-burst-of-thunder_wind blowing& start_@_people speaking& mid_@_thunder& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
<tr> <td><font size="-1">A train running on railroad tracks followed by a train horn blowing and steam hissing</font></td>
<td><font size="-1"><train running on railroad tracks& all>@<train horn blowing& end>@<steam hissing& end></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\A-train-running-on-railroad-tracks-followed-by-a-train-horn-blowing-and-steam-hissing_train running on railroad tracks& all_@_train horn blowing& end_@_steam hissing& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><font size="-1"><train running on railroad tracks& all>@<train horn blowing& mid>@<steam hissing& end></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\A-train-running-on-railroad-tracks-followed-by-a-train-horn-blowing-and-steam-hissing_train running on railroad tracks& all_@_train horn blowing& mid_@_steam hissing& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
<tr> <td><font size="-1">Winds and ocean waves crashing while a chime instrument briefly plays a melody</font></td>
<td><font size="-1"><winds& all>@<ocean waves crashing& all>@<chime instrument melody& all></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Winds-and-ocean-waves-crashing-while-a-chime-instrument-briefly-plays-a-melody_winds& all_@_ocean waves crashing& all_@_chime instrument melody& all_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><font size="-1"><winds& all>@<ocean waves crashing& all>@<chime instrument melody& mid></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Winds-and-ocean-waves-crashing-while-a-chime-instrument-briefly-plays-a-melody_winds& all_@_ocean waves crashing& all_@_chime instrument melody& mid_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
<tr> <td><font size="-1">Constant faint humming and a few light knocks</font></td>
<td><font size="-1"><constant faint humming & all>@<a few light knocks & end></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Constant-faint-humming-and-a-few-light-knocks_constant faint humming& all_@_a few light knocks& end_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><font size="-1"><constant faint humming & all>@<a few light knocks & start></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data\precise\Constant-faint-humming-and-a-few-light-knocks_constant faint humming& all_@_a few light knocks& start_.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
</tbody>
</table>
</div>
</div>
<div class="container shadow p-5 mb-5 bg-white rounded">
<h3>Comparison between dual text encoders and only structured text encoder<a id="dual_comparison"></h3>
<p class="mb-0">
When LLM parsing the original natural language input, some adjective or quantifier may be lost, and sometimes the structured inputs' format is incorrect. Dual text encoders can avoid information loss and are more robust in these situations.
</p>
<div class="container pt-3 table-responsive">
<table
class="table table-hover"
id="prompting-table">
<thead>
<tr>
<th width="25%">Origin Input</th>
<th width="25%">Wrongly Structured Input</th>
<th width="25%">Dual text encoders</th>
<th width="25%">Only structured text encoder</th>
</tr>
</thead>
<tbody>
<tr>
</tr>
<tr> <td><font size="-1">A strong torrent of rain is audible outside of a window</font></td>
<td><font size="-1"><strong>Sound of strong torrent of rain outside window & all</strong></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-strong-torrent-of-rain-is-audible-outside-of-a-window.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-strong-torrent-of-rain-is-audible-outside-of-a-window.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
<tr> <td><font size="-1">A motorcycle revving by quickly twice</font></td>
<td><font size="-1"><motorcycle revving & all>@<quickly twice & end></quickly></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-motorcycle-revving-by-quickly-twice.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-motorcycle-revving-by-quickly-twice.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
<tr> <td><font size="-1">A car moves quickly and is followed by someone walking and other cars</font></td>
<td><font size="-1"><car engine revving & start>@<car tires screeching & mid>@<footsteps running & mid>@<other car engines & mid to end></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-car-moves-quickly-and-is-followed-by-someone-walking-and-other-cars.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-car-moves-quickly-and-is-followed-by-someone-walking-and-other-cars.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
<tr> <td><font size="-1">A metallic swirling and scraping that gets louder and more irregular</font></td>
<td><font size="-1"><metallic swirling and scraping & all, getting louder and more irregular>@</font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-metallic-swirling-and-scraping-that-gets-louder-and-more-irregular.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-metallic-swirling-and-scraping-that-gets-louder-and-more-irregular.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
<tr> <td><font size="-1">A gusting wind with waves crashing in the background from time to time</font></td>
<td><font size="-1"><gusting wind & all>@<waves crashing & random intervals></font></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/dual/A-gusting-wind-with-waves-crashing-in-the-background-from-time-to-time.wav' type="audio/wav">Your browser does not support the audio element.</audio></td>
<td><audio controls controlslist="nodownload" class="px-1"> <source src='data/dual_comparison/structured/A-gusting-wind-with-waves-crashing-in-the-background-from-time-to-time.wav' type="audio/wav">Your browser does not support the audio element.</audio></td> </tr>
</tbody>
</table>
</div>
</div>
<div class="container shadow p-5 mb-5 bg-white rounded">
<h3>Broader impact<a id="impact"/></h3>
<p class="mb-0">
We believe that our T2A work on temporal enhancement can serve as an important stepping stone for future work on generating semantically aligned and temporally consistent audio.
And our approach of constructing complex audio and enhancing the data based on LLM can provide inspiration for future work.
<br/>
At the same time, we acknowledge that Make-An-Audio 2 may lead to unintended consequences such as increased unemployment for individuals in related fields such as sound engineering and radio hosting. Furthermore, there are potential concerns regarding the ethics of non-consensual voice cloning or the creation of fake media.
</p>
<br/>
</div>
</body>
</html>