-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
292 lines (252 loc) · 10.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
<script src="http://www.google.com/jsapi" type="text/javascript"></script>
<script type="text/javascript">google.load("jquery", "1.3.2");</script>
<style type="text/css">
body {
font-family: "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif;
font-weight:300;
font-size:18px;
margin-left: auto;
margin-right: auto;
width: 1100px;
}
h1 {
font-size:32px;
font-weight:300;
}
.disclaimerbox {
background-color: #eee;
border: 1px solid #eeeeee;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
padding: 20px;
}
video.header-vid {
height: 140px;
border: 1px solid black;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
}
img.header-img {
height: 140px;
border: 1px solid black;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
}
img.rounded {
border: 1px solid #eeeeee;
border-radius: 10px ;
-moz-border-radius: 10px ;
-webkit-border-radius: 10px ;
}
a:link,a:visited
{
color: #1367a7;
text-decoration: none;
}
a:hover {
color: #208799;
}
td.dl-link {
height: 160px;
text-align: center;
font-size: 22px;
}
.layered-paper-big { /* modified from: http://css-tricks.com/snippets/css/layered-paper/ */
box-shadow:
0px 0px 1px 1px rgba(0,0,0,0.35), /* The top layer shadow */
5px 5px 0 0px #fff, /* The second layer */
5px 5px 1px 1px rgba(0,0,0,0.35), /* The second layer shadow */
10px 10px 0 0px #fff, /* The third layer */
10px 10px 1px 1px rgba(0,0,0,0.35), /* The third layer shadow */
15px 15px 0 0px #fff, /* The fourth layer */
15px 15px 1px 1px rgba(0,0,0,0.35), /* The fourth layer shadow */
20px 20px 0 0px #fff, /* The fifth layer */
20px 20px 1px 1px rgba(0,0,0,0.35), /* The fifth layer shadow */
25px 25px 0 0px #fff, /* The fifth layer */
25px 25px 1px 1px rgba(0,0,0,0.35); /* The fifth layer shadow */
margin-left: 10px;
margin-right: 45px;
}
.paper-big { /* modified from: http://css-tricks.com/snippets/css/layered-paper/ */
box-shadow:
0px 0px 1px 1px rgba(0,0,0,0.35); /* The top layer shadow */
margin-left: 10px;
margin-right: 45px;
}
.layered-paper { /* modified from: http://css-tricks.com/snippets/css/layered-paper/ */
box-shadow:
0px 0px 1px 1px rgba(0,0,0,0.35), /* The top layer shadow */
5px 5px 0 0px #fff, /* The second layer */
5px 5px 1px 1px rgba(0,0,0,0.35), /* The second layer shadow */
10px 10px 0 0px #fff, /* The third layer */
10px 10px 1px 1px rgba(0,0,0,0.35); /* The third layer shadow */
margin-top: 5px;
margin-left: 10px;
margin-right: 30px;
margin-bottom: 5px;
}
.vert-cent {
position: relative;
top: 50%;
transform: translateY(-50%);
}
hr
{
border: 0;
height: 1px;
background-image: linear-gradient(to right, rgba(0, 0, 0, 0), rgba(0, 0, 0, 0.75), rgba(0, 0, 0, 0));
}
.slider-container {
max-width: 690px;
margin: auto;
display: flex; /* Use flexbox for layout */
align-items: center; /* Vertically center buttons */
}
.slide {
display: none;
}
.active {
display: block;
}
.prev, .next {
cursor: pointer;
padding: 16px;
color: white;
font-weight: bold;
font-size: 18px;
transition: 0.6s ease;
border-radius: 0 3px 3px 0;
user-select: none;
background-color: rgba(0,0,0,0.5);
}
.prev:hover, .next:hover {
background-color: rgba(0,0,0,0.8);
}
</style>
<html>
<head>
<title>MLLM Projections</title>
<meta property="og:image" content="./assets/teaser.png"/>
<meta property="og:title" content="Cross-Modal Projection in Multimodal LLMs" />
<meta property="og:description" content="Paper Title: Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space (ACL 2024 Main); Authors: Gaurav Verma, Minje Choi, Kartik Sharma, Jamelle Watson-Daniels, Sejoon Oh, Srijan Kumar; Affiliations: Georgia Institute of Technology" />
<script async src=""></script>
</head>
<body>
<br>
<center>
<span style="font-size:36px">Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space</span><br/>
<span style="font-size:25px">[<a href="https://arxiv.org/abs/2402.16832">Paper</a>] [<a href="https://github.com/claws-lab/projection-in-MLLMs">GitHub</a>]</span><br/><br/>
<span><img src ="assets/acl-logo.png" width="250px;"/></span><br/><br/>
<span><a href="https://gaurav22verma.github.io/">Gaurav Verma</a><sup>1</sup>,
<a href="https://minjechoi.github.io/">Minje Choi</a><sup>1</sup>,
<a href="https://ksartik.github.io/">Kartik Sharma</a><sup>1</sup>,<br/>
<a href="https://www.jamellewd.com/">Jamelle Watson-Daniels</a><sup>2</sup>,
<a href="https://sejoonoh.github.io/">Sejoon Oh</a><sup>1</sup>,
and <a href="https://faculty.cc.gatech.edu/~srijan/">Srijan Kumar</a><sup>1</sup>
</span><br/><br/>
<span><sup>1</sup>Georgia Institute of Technology, <sup>2</sup>Harvard University</span><br/>
<a href="https://www.cc.gatech.edu/"><img src="./assets/gt-logo.png" width=200px></a> <a href="https://seas.harvard.edu/"><img src="./assets/hrvd-logo.png" width=140px></a><br/><br/>
<hr><br/><br/>
<center>
<table align=center width=650px>
<tr>
<td width=260px>
<center>
<img class="round" style="width:650px" src="./assets/overview.png"/><br/><br/>
</center>
<b>Overview of our study</b>: While the MLLM's domain-specific visual capability can be improved using fine-tuning strategies, the domain-specific richness of the image's post-projection representation does not improve. Results indicate that domain-specific visual attributes are predominantly modeled by the LLM parameters (whether frozen or not) and the projection does not necessarily play a role in mapping visual attributes to the LLM space. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLMs.
<!-- </center> -->
</td>
</tr>
</table><br/><br/>
<hr>
</center>
<table align=center width=850px>
<center><h1>Technical Abstract</h1></center>
<tr>
<td>
Multimodal large language models (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a large language model. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on $4$ datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do <em>not</em> lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures.
</td>
</tr>
</table>
<br>
<hr>
<center><h1>Annotated Key Results</h1>Use the next and previous sliders to go over the annotated version of our results and insights.<br/><br/></center>
<table align=center width=950px>
<center>
<tr>
<td>
<center><div id="counter"></div></center><br/>
<div class="slider-container">
<button class="prev" onclick="plusSlides(-1)">❮</button>
<div class="slide active">
<img src="./assets/image1.png" alt="Image 1" width="600px;" height="340px;">
</div>
<div class="slide">
<img src="./assets/image2.png" alt="Image 2" width="600px;" height="340px;">
</div>
<div class="slide">
<img src="./assets/image3.png" alt="Image 3" width="600px;" height="340px;">
</div>
<div class="slide">
<img src="./assets/image4.png" alt="Image 4" width="600px;" height="340px;">
</div>
<button class="next" onclick="plusSlides(1)">❯</button>
</div>
</td>
</tr>
</center>
</table><br/>
<hr>
<table align=center width=1050px>
<center><h1>Paper and Bibtex</h1></center>
<tr>
<td><a href="./assets/projection-in-MLLMs.pdf"><img class="layered-paper-big" style="height:175px" src="./assets/screenshot.png"/></a></td>
<td>
<span style="font-size:12pt">Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space<br>
Gaurav Verma, Minje Choi, Kartik Sharma, Jamelle Watson-Daniels, Sejoon Oh, Srijan Kumar<br>
62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)<br/>
Webpage: <a href="https://claws-lab.github.io/projection-in-MLLMs">https://claws-lab.github.io/projection-in-MLLMs</a><br/>
Code: <a href="https://github.com/claws-lab/projection-in-MLLMs">https://github.com/claws-lab/projection-in-MLLMs</a><br/>
arXiv: <a href="https://arxiv.org/abs/2402.16832">https://arxiv.org/abs/2402.16832</a></span><br><br/><br/>
</td>
</tr>
</table>
<table align=center width=600px>
<tr>
<td><span style="font-size:11pt">
<span style="font-size: 14pt">Bibtex:</span><br/><br/>
<left>
<code>
@article{verma2024crossmodalprojection,<br/>
title={Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space},<br/>
author={Verma, Gaurav and Choi, Minje and Sharma, Kartik and Watson-Daniels, Jamelle and Oh, Sejoon and Kumar, Srijan},<br/>
publisher={62nd Annual Meeting of the Association for Computational Linguistics (ACL)},<br/>
year={2024}<br/>
}
</code>
</left>
</td>
</tr>
</table><br/>
<hr>
<br>
<table align=center width=900px>
<tr>
<td width=400px>
<center>
<span style="font-size: 10px;">
The template is built on top of the <a href="https://github.com/richzhang/webpage-template">one</a> build by <a href="http://web.mit.edu/phillipi/">Phillip Isola</a> and <a href="http://richzhang.github.io/">Richard Zhang</a>.
</span>
</center>
</td>
</tr>
</table>
<script src="./assets/script.js"></script>
<br>
</body>
</html>