-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpapers_read.html
2203 lines (1979 loc) · 214 KB
/
papers_read.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Searchable paper list</title>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<h1>Here's where I keep a list of papers I have read.</h1>
<p>
This list was curated by <a href="index.html">Lexington Whalen</a>, beginning from his first year of PhD to end. As he is me, I hope he keeps going!
</p>
<p>
I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
</p>
<p id="paperCount">
So far, we have read 206 papers. Let's keep it up!
</p>
<small id="searchCount">
Your search returned 206 papers. Nice!
</small>
<div class="search-inputs">
<input type="text" id="titleSearch" placeholder="Search title...">
<input type="text" id="authorSearch" placeholder="Search author...">
<input type="text" id="yearSearch" placeholder="Search year...">
<input type="text" id="topicSearch" placeholder="Search topic...">
<input type="text" id="venueSearch" placeholder="Search venue...">
<input type="text" id="descriptionSearch" placeholder="Search description...">
<button id="clearSearch">Clear Search</button>
</div>
<table id="paperTable">
<thead>
<tr>
<th data-sort="title">Title</th>
<th data-sort="author">Author</th>
<th data-sort="year">Year</th>
<th data-sort="topic">Topic</th>
<th data-sort="venue">Venue</th>
<th data-sort="description">Description</th>
<th>Link</th>
</tr>
</thead>
<tbody>
<tr>
<td>Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads</td>
<td>Tianle Cai et al</td>
<td>2024</td>
<td>speculative decoding, drafting, llm</td>
<td>ICML</td>
<td>This paper presents Medusa which augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. They also introduce a form of tree-based attention to process candidates. Through the Medusa heads, they obtain probability predictions for the subsequent K+1 tokens. These predictions enable them to create length-K+1 continuations as the candidates. In order to process multiple cnadidates concurrently, they structure their attention such that only tokens from the same continuation are regarded as historical data.For instance, they have in Figure 2 an example where the first Medusa head and generates some top two predictions while the second medusa head generates a top three for each of the top two from the first head. Instead of filling the entire attention mask, they only consider the mask from these 2*3 = 6 tokens, plus the standard identity line.</td>
<td><a href="https://arxiv.org/pdf/2401.10774" target="_blank">Link</a></td>
</tr>
<tr>
<td>Recurrent Drafter for Fast Speculative Decoding in Large Language Models</td>
<td>Yunfei Cheng et al</td>
<td>2024</td>
<td>speculative decoding, drafting, llm</td>
<td>Arxiv</td>
<td>This paper introduces ReDrafter (Recurrent Drafter) that uses an RNN as the draft model and conditions on the LLM's hidden states. They use a beam search to explore the candidate seqeunces and then apply a dynamic tree attention alg to remove duplicated prefixes among the candidates to improve the speedup. They also train via knowledge distillation from LLMs to improve the alignment of the draft model's predictions with those of the LLM.</td>
<td><a href="https://arxiv.org/pdf/2403.09919" target="_blank">Link</a></td>
</tr>
<tr>
<td>QuIP: 2-Bit Quantization of Large Language Models With Guarantees</td>
<td>Jerry Chee et al</td>
<td>2024</td>
<td>quantization, block-wise</td>
<td>Arxiv</td>
<td>QuIP (quantization with incoherence processing) is a method based on the insight that quantization benefits from incoherent weight and Hessian mats, meaning that they benefit from the weights being even in magnitude and benefit from having the directions in whcih they are rounded to being unaligned with the coordinate axes. </td>
<td><a href="https://arxiv.org/pdf/2307.13304" target="_blank">Link</a></td>
</tr>
<tr>
<td>BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference</td>
<td>Wonsuk Jang et al</td>
<td>2024</td>
<td>quantization, block-wise</td>
<td>Arxiv</td>
<td>This paper introduces a block-wise quantization scheme that assigns a per-block optimal number format from a format book (they make their own format book called "DialectFP4"). "Focusing on how to represent over how to scale".</td>
<td><a href="https://arxiv.org/pdf/2501.01144v2" target="_blank">Link</a></td>
</tr>
<tr>
<td>SpinQuant: LLM Quantization with Learned Rotations</td>
<td>Zechun Liu et al</td>
<td>2024</td>
<td>quantization, spins, rotation</td>
<td>Arxiv</td>
<td>This paper uses two mergeable rotation matrices (R1, R2) that make rotationally invariant full-precision networks, and then apply two online Hadamard rotations (G3, R4) to further reduce the outliers so they can quantize activations and KV-cache quantizations. They then show how one can optimize these rotation matrices on Stiefel manifolds (orthogonal manifolds) using Cayley SGD. The reason for Cayley SGD and Stiefel manifolds is bc they need to optimize rotation matrices (R1, R2) such that they stay orthogonal during optimization. Regular SGD would break this constraint. By optimizing on Stiefel manifolds (space of all orthonormal matrices), they can specifc that the optimizations stays on a specific surface that only contains rotation matrices.</td>
<td><a href="https://arxiv.org/pdf/2405.16406" target="_blank">Link</a></td>
</tr>
<tr>
<td>SnapKV: LLM Knows What You are Looking for Before Generation</td>
<td>Yuhong Li et al</td>
<td>2024</td>
<td>llm, kv cache</td>
<td>Arxiv</td>
<td>This paper identifies and selects the most important features per head to create compressed KV cache. It works in two stages: 1) vote for important previous features by taking the last segment of the prompt ("observation window") and uses this window to analyze which parts of the earlier text (prefix) are most important. For each attn head, we aggregate the attn weights from queries in the observation window. Then we select the top-k positions based on the aggregated weights (k=p*L_prefix, where p is the compression weight) 2) then cluster and perform context preservation: we then use a pooling layer to cluster the selected important features. The last part of the prompt is kept as the observation window because they note that the attention patterns observed in the last window of the input sequence have high overlap rates (~80-90%) with the actual attention patterns used during generation.</td>
<td><a href="https://arxiv.org/pdf/2404.14469" target="_blank">Link</a></td>
</tr>
<tr>
<td>Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention</td>
<td>Angelos Katharopoulos et al</td>
<td>2020</td>
<td>attention, transformer</td>
<td>ICML</td>
<td>This paper rephrases transformers as RNNs (title). They express the self-attention mechanism as a linear dot-product of kernel feature maps to make the complexity go from O(N^2) to O(N). Personal note: this is the 200th paper recorded on here, and the last of 2024! Summer of 2024 was when I began studying machine learning. Let's keep it up!</td>
<td><a href="https://arxiv.org/pdf/2006.16236" target="_blank">Link</a></td>
</tr>
<tr>
<td>Prefix-Tuning: Optimizing Continuous Prompts for Generation</td>
<td>Xiang Lisa Li et al</td>
<td>2021</td>
<td>prefix-tuning, prompting, llm</td>
<td>Arxiv</td>
<td>This paper proposes prefix-tuning, which keeps language model params frozen but optimizes a continuous task-specific vector (prefix).</td>
<td><a href="https://arxiv.org/pdf/2101.00190" target="_blank">Link</a></td>
</tr>
<tr>
<td>The Power of Scale for Parameter-Efficient Prompt Tuning</td>
<td>Brian Lester et al</td>
<td>2021</td>
<td>prompting, llm</td>
<td>Arxiv</td>
<td>This paper explores adding soft prompts to condition frozen language models. Basically, soft prompts are learned through back-propagation and can be used to finetune language models without fully retraining. They also introduce the idea of "prompt ensembling" which is basically using multiple soft prompts on a model and ensembling their outputs.</td>
<td><a href="https://arxiv.org/pdf/2104.08691" target="_blank">Link</a></td>
</tr>
<tr>
<td>Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs</td>
<td>Nguyen Nhat Minh et al</td>
<td>2024</td>
<td>sampling, llm</td>
<td>Arxiv</td>
<td>This paper introduces a neat trick to sample the next token. Min-p sampling basically adjusts the sampling threshold based on the model's confidence. It does so by scaling according to the top token's probability. This is a compelling alternative to other common sampling methods, like nucleus sampling.</td>
<td><a href="https://arxiv.org/pdf/2407.01082" target="_blank">Link</a></td>
</tr>
<tr>
<td>LASER: Attention with Exponential Transformation</td>
<td>Sai Surya Duvvuri et al</td>
<td>2024</td>
<td>attention, gradients</td>
<td>Arxiv</td>
<td>This paper identifies that gradients backpropagated through the softmax operation often can be quite small. To mitigate this, they propose doing a dot-product attention with an exp()-transformed value matrix V (meaning, they do the attention calculation on exp(V)), which allows for a larger Jacobian (mitigating the small gradient issue).</td>
<td><a href="https://arxiv.org/pdf/2411.03493" target="_blank">Link</a></td>
</tr>
<tr>
<td>Hyper-Connections</td>
<td>Defa Zhu et al</td>
<td>2024</td>
<td>residual connections, hyper-connections</td>
<td>Arxiv</td>
<td>This paper introduces hyper-connections, which is a novel alternative to residual connections. Basically, they introduce learnable depth and width connections.</td>
<td><a href="https://arxiv.org/pdf/2409.19606" target="_blank">Link</a></td>
</tr>
<tr>
<td>Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising</td>
<td>Gongfan Fang et al</td>
<td>2024</td>
<td>dit, diffusion, moe</td>
<td>NeurIPS</td>
<td>This paper introduces a method of mixing diffusion models for multi-expert denoising. Basically, they increase the width of the linear layers by a factor of K, and then modify the forward pass to support it. This allows for K experts that are initialized from the original weights. </td>
<td><a href="https://arxiv.org/pdf/2412.05628" target="_blank">Link</a></td>
</tr>
<tr>
<td>Hymba: A Hybrid-head Architecture for Small Language Models</td>
<td>Xin Dong et al</td>
<td>2024</td>
<td>llm, hybrid, meta-tokens</td>
<td>Arxiv</td>
<td>This paper introduces a family of small language models that have a hybrid attention-ssm head parallel architecture. There are many interesting architectural designs to note here, but my favoriate is the use of "meta tokens", learnable tokens that are prepended to prompts. These tokens help reduce the entropy of attention and ssm heads, and can be seen as a good initialization for KV cache and the SSM state.</td>
<td><a href="https://arxiv.org/pdf/2411.13676" target="_blank">Link</a></td>
</tr>
<tr>
<td>All are Worth Words: A ViT Backbone for Diffusion Models</td>
<td>Fan Bao et al</td>
<td>2023</td>
<td>diffusion, vit</td>
<td>Arxiv</td>
<td>This paper designs a general ViT-based architecture for diffusion models. Notably, it treats all inputs (time, condition, noisy image patches) as tokens and uses long skip connections between the shallow and deep layers.</td>
<td><a href="https://arxiv.org/pdf/2209.12152" target="_blank">Link</a></td>
</tr>
<tr>
<td>SliceGPT: Compress Large Language Models by Deleting Rows and Columns</td>
<td>Saleh Ashkboos et al</td>
<td>2024</td>
<td>pruning, llm</td>
<td>ICLR</td>
<td>The authors propose a method of slicing off entire rows or columns of weight matrices. They do this by applying a transformation that leaves the predictions invariant prior to the slice. The authors also introduce the notion of "computational invariance", AKA that one can apply orthogonal matrix transformations to each weight matrix in the transformer without changing the model, which they use to edit the blocks in a transformer to project the activation matrix between blocks onto its principal components, and then slice. They make the key insight that if you insert linear layers with the orthogonal matrix Q before RMSNorm and Q^{T} after, the network remains unchanged, i.e. RMSNorm(XQ)Q^{T} = RMSNorm(X). They also note that since LayerNorm can be converted to RMSNorm, LayerNorm is the same story. To find the Qs they use a calibration dataset from the training set and run it thru the model. They then use the output of the network to find the orthogonal matrices of the next layers by computing the covariance matrix and then getting the eigenvalues (read the paper for more).</td>
<td><a href="https://arxiv.org/pdf/2401.15024" target="_blank">Link</a></td>
</tr>
<tr>
<td>Visual Autoregressive Modeling: Scaling Image Generation via Next-Scale Prediction</td>
<td>Key Tian et al</td>
<td>2024</td>
<td>tokens, reference model</td>
<td>NeurIPS</td>
<td>The paper proposes Visual AutoRegressive (VAR) modeling, which shifts the paradigm of autoregressive learning for image generation from sequential "next-token prediction" to "next-scale prediction." This approach treats entire token maps at progressively finer resolutions as the autoregressive units, reflecting the coarse-to-fine manner in which humans perceive images. Unlike traditional models that flatten 2D spatial structures into 1D sequences, VAR preserves spatial locality and leverages multi-scale visual representations to reduce computational inefficiencies. By adopting hierarchical generation aligned with natural image structures, VAR overcomes the limitations of standard autoregressive models, such as mathematical premise violations and loss of spatial coherence. Its design integrates autoregressive transformers with multi-scale tokenization, creating a framework that is theoretically scalable and generalizable across diverse visual generation tasks.</td>
<td><a href="https://openreview.net/pdf?id=gojL67CfS8" target="_blank">Link</a></td>
</tr>
<tr>
<td>Rho-1: Not All Tokens Are What You Need</td>
<td>Zhenghao Lin et al</td>
<td>2024</td>
<td>tokens, reference model</td>
<td>NeurIPS</td>
<td>This paper scores tokens using a reference model and then trains a language model to focus on the tokens with higher scores. They find that they can improve performance while training on less tokens.</td>
<td><a href="https://arxiv.org/pdf/2404.07965" target="_blank">Link</a></td>
</tr>
<tr>
<td>QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs</td>
<td>Saleh Ashkboos et al</td>
<td>2024</td>
<td>quantization, rotation</td>
<td>NeurIPS</td>
<td>This paper introduces a quantization scheme based on rotations, that allows quantization of down to 4-bits for weights, activations, and KV cache. They rotate LLMs in such a way that removes outliers from hidden state w/o changing the output. In particular, they use randomized Hadamard transformations on the weight matrices to remove outlier features and make activations easier to quantize. They then extend this to apply online Hadamard transformations to attention model to remove outlier features in keys and values, which allows the KV cache to be quantized.</td>
<td><a href="https://arxiv.org/pdf/2404.00456" target="_blank">Link</a></td>
</tr>
<tr>
<td>Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch</td>
<td>Le Yu et al</td>
<td>2024</td>
<td>model merging</td>
<td>ICML</td>
<td>This paper shows that language models (LMs) can get new abilities via assimilating params from homologous models. They also note that LMs after Supervised Fine-Tuning (SFT) have many redundant delta parameters (i.e, the alteration of the model params before and after SFT). They then present DARE (Drop And REscale) as a means of setting delta parameters to zero with drop rate of p and then rescaling the remaining ones by a factor of 1/(1-p). They then use DARE to remove redundant delta parameters in each model prior to merging, which they find can help mitigate the interference of params among multiple models. Then they use standard model merging techniqes to merge the models.</td>
<td><a href="https://arxiv.org/pdf/2311.03099" target="_blank">Link</a></td>
</tr>
<tr>
<td>Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch</td>
<td>Le Yu et al</td>
<td>2024</td>
<td>model merging</td>
<td>ICML</td>
<td>This paper shows that language models (LMs) can get new abilities via assimilating params from homologous models. They also note that LMs after Supervised Fine-Tuning (SFT) have many redundant delta parameters (i.e, the alteration of the model params before and after SFT). They then present DARE (Drop And REscale) as a means of setting delta parameters to zero with drop rate of p and then rescaling the remaining ones by a factor of 1/(1-p). They then use DARE to remove redundant delta parameters in each model prior to merging, which they find can help mitigate the interference of params among multiple models. Then they use standard model merging techniqes to merge the models.</td>
<td><a href="https://arxiv.org/pdf/2311.03099" target="_blank">Link</a></td>
</tr>
<tr>
<td>Training-Free Pretrained Model Merging</td>
<td>Zhengqi Xu et al</td>
<td>2024</td>
<td>model merging</td>
<td>CVPR</td>
<td>This paper introduces Merging under Dual-Space Constraints (MuDSC), a novel framework for merging pretrained neural network models without additional training or requiring the same pretraining initialization. Unlike prior approaches that operate solely in either the weight space or the activation space, MuDSC addresses inconsistencies between these two spaces by combining their similarity measures into a unified objective using a weighted linear combination. This approach ensures that merged units are similar in both their structure and behavior, leading to more consistent and effective merging outcomes. The framework also adapts to networks with group structures, such as those using multi-head attention or group normalization, by proposing modifications to unit-matching algorithms. Overall, MuDSC simplifies model merging while enhancing performance across diverse architectures and tasks, enabling merged models to achieve balanced and overlapping multi-task performance.</td>
<td><a href="https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Training-Free_Pretrained_Model_Merging_CVPR_2024_paper.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Similarity of Neural Network Representations Revisited</td>
<td>Simon Kornblith et al</td>
<td>2019</td>
<td>network similarity</td>
<td>ICML</td>
<td>This paper examines methods for comparing neural network representations and proposes Centered Kernel Alignment (CKA) as a more effective similarity measure. The authors provide key theoretical insights about what properties a similarity metric should have - arguing it should be invariant to orthogonal transformations and isotropic scaling, but not to arbitrary invertible linear transformations, as neural network training itself isn't invariant to such transformations. They show that for representations with more dimensions than training examples, any metric invariant to arbitrary invertible transformations will give meaningless results. CKA works by first measuring the similarity between every pair of examples in each representation separately (creating representational similarity matrices), then comparing these similarity structures - when using inner products, this reduces to computing normalized Hilbert-Schmidt Independence Criterion between the representations. They demonstrate theoretically that CKA is closely related to canonical correlation analysis (CCA) and regression, but incorporates feature scale information that CCA discards. Finally, they show that unlike previous methods like CCA and SVCCA, CKA can reliably identify corresponding layers between networks trained from different initializations and reveal meaningful relationships between different architectures.</td>
<td><a href="https://arxiv.org/pdf/1905.00414" target="_blank">Link</a></td>
</tr>
<tr>
<td>What is being transferred in transfer learning?</td>
<td>Behnam Neyshabur et al</td>
<td>2020</td>
<td>transfer learning</td>
<td>NeurIPS</td>
<td>This paper investigated what exactly gets transferred during transfer learning in neural networks through a comprehensive series of analyses. Through experiments with block-shuffled images, the researchers demonstrated that successful transfer learning relies on both feature reuse and low-level statistics of the data, showing that even when visual features are disrupted, transfer learning still provides benefits. The study revealed that models fine-tuned from pre-trained weights tend to stay in the same basin in the loss landscape, make similar mistakes, and remain close to each other in parameter space, while models trained from scratch end up in different basins with more diverse behaviors. By analyzing module criticality, they found that lower layers handle more general features while higher layers become more specialized, confirming previous theories about feature hierarchy in neural networks. Finally, they showed that transfer learning can begin from earlier checkpoints of the pre-trained model without losing accuracy, suggesting that the benefits of pre-training emerge before the model fully converges on the source task.</td>
<td><a href="https://arxiv.org/pdf/2008.11687" target="_blank">Link</a></td>
</tr>
<tr>
<td>ZipIt! Merging Models from Different Tasks without Training</td>
<td>George Stoica et al</td>
<td>2024</td>
<td>model merging</td>
<td>ICLR</td>
<td>This paper presents a novel approach to model merging that significantly improves upon previous methods by recognizing that similar features can exist within the same model, not just across different models. The key insight is that when merging models trained on different tasks, it's often better to combine similar features within each model first, rather than forcing dissimilar features from different models to merge, as these features may have developed to solve fundamentally different problems. Their method first concatenates the feature spaces of both models and computes a comprehensive correlation matrix between all features (both within and across models), using these correlations to guide intelligent feature merging decisions. To handle the multi-layer nature of neural networks, they introduce an "unmerge" operation that allows the merged features to remain compatible with later layers in both original networks, essentially decompressing the merged features before they're processed by subsequent layers. Theoretically, they prove that this approach provides better guarantees than traditional cross-model merging, showing that when models have internal redundancy (which is common in practice), their method can achieve perfect merging with zero performance loss.</td>
<td><a href="https://arxiv.org/pdf/2305.03053" target="_blank">Link</a></td>
</tr>
<tr>
<td>TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts</td>
<td>Ruida Wang et al</td>
<td>2024</td>
<td>llm, llm agent</td>
<td>Arxiv</td>
<td>This research introduces TheoremLlama, a framework that transforms general-purpose Large Language Models (LLMs) into expert theorem provers for the Lean4 formal mathematics language, addressing a significant challenge in automated theorem proving. The key innovation is their "NL-FL bootstrapping" method, which integrates natural language reasoning steps directly into formal mathematical proofs as comments during training, helping LLMs bridge the gap between natural language understanding and formal mathematical reasoning. The researchers also contribute the Open Bootstrapped Theorems (OBT) dataset, containing over 100,000 theorem-proof pairs with aligned natural and formal language, helping address the scarcity of training data in this field. The framework introduces specialized training techniques like block training and curriculum learning that help LLMs gradually build theorem-proving capabilities, potentially offering a blueprint for adapting LLMs to other specialized domains that lack extensive training data.</td>
<td><a href="https://arxiv.org/pdf/2407.03203" target="_blank">Link</a></td>
</tr>
<tr>
<td>A Simple Early Exiting Framework for Accelerating Sampling in Diffusion Models</td>
<td>Taehong Moon et al</td>
<td>2024</td>
<td>diffusion, early exit</td>
<td>ICML</td>
<td>This paper presents Adaptive Score Estimation (ASE), a novel framework that accelerates diffusion model sampling by adaptively allocating computational resources based on the time step being processed. The authors observe that score estimation near the noise distribution (t→1) requires less computational power than estimation near the data distribution (t→0), leading them to develop a time-dependent early-exiting scheme where more neural network blocks are skipped during the noise-phase sampling steps. Their approach differs between architectures - for DiT models they skip entire blocks, while for U-ViT models they preserve the linear layers connected to skip connections while dropping other block components to maintain the residual pathway information. The authors fine-tune their models using a specially designed training procedure that employs exponential moving averages and weighted coefficients to ensure minimal information updates near t→0 while allowing more updates near t→1.</td>
<td><a href="https://arxiv.org/pdf/2408.05927" target="_blank">Link</a></td>
</tr>
<tr>
<td>Active Prompting with Chain-of-Thought for Large Language Models</td>
<td>Shizhe Diao et al</td>
<td>2023</td>
<td>prompting, cot</td>
<td>Arxiv</td>
<td>The paper introduces Active-Prompt, a novel method that improves chain-of-thought (CoT) prompting by strategically selecting which examples to use as demonstrations for large language models. Rather than using randomly selected or manually crafted examples, Active-Prompt identifies the most informative examples by measuring the model's uncertainty on different potential prompts through metrics like disagreement and entropy across multiple model outputs. The key insight is that by systematically choosing examples where the model shows high uncertainty, and then having humans provide detailed reasoning chains for those specific cases, the resulting prompts will be more effective at teaching the model how to approach challenging problems. This approach shifts the human effort from trying to intuitively guess good examples to a more principled selection process guided by the model's own uncertainty signals.</td>
<td><a href="https://arxiv.org/pdf/2302.12246" target="_blank">Link</a></td>
</tr>
<tr>
<td>RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment</td>
<td>Hanze Dong et al</td>
<td>2023</td>
<td>watermark, offset learning</td>
<td>TMLR</td>
<td>The paper introduces RAFT (Reward rAnked FineTuning), a simpler alternative to RLHF for aligning generative models with human preferences. The key insight is decoupling the data generation and model fine-tuning steps - instead of using complex reinforcement learning, RAFT generates multiple samples for each prompt, ranks them by reward, and then fine-tunes the model on only the highest-scoring samples in an iterative process. This approach is more stable and efficient than RLHF because it uses standard supervised learning techniques rather than RL, while being less sensitive to reward scaling issues since it only uses relative rankings rather than absolute reward values. Additionally, the decoupled nature of RAFT means it requires less memory (only needs to load one model at a time vs multiple for RLHF) and allows more flexibility in data collection and processing.</td>
<td><a href="https://arxiv.org/pdf/2304.06767" target="_blank">Link</a></td>
</tr>
<tr>
<td>Finding needles in a haystack: A Black-Box Approach to Invisible Watermark Detection</td>
<td>Minzhou Pan et al</td>
<td>2024</td>
<td>watermark, offset learning</td>
<td>Arxiv</td>
<td>The key insight of this paper centers on using "offset learning" to detect invisible watermarks in images. The intuition is that by having a clean reference dataset of similar images, you can effectively "cancel out" the normal image features that are common between clean and watermarked images, leaving only the watermark perturbations. They design an asymmetric loss function where clean images use exponential/softmax loss (to focus on hard examples) while detection dataset uses linear loss (to give equal weight to all examples), helping isolate the watermark signal. This is combined with an iterative pruning strategy that gradually removes likely-clean images from the detection set, allowing the model to better focus on and learn the watermark patterns. By formulating watermark detection this way, they avoid needing any prior knowledge of watermarking techniques or labeled data, making it a truly black-box approach.</td>
<td><a href="https://arxiv.org/pdf/2403.15955" target="_blank">Link</a></td>
</tr>
<tr>
<td>Mitigating the Alignment Tax of RLHF</td>
<td>Yong Lin et al</td>
<td>2024</td>
<td>rlhf, alignment</td>
<td>Arxiv</td>
<td>This paper investigates the "alignment tax" problem where large language models lose some of their pre-trained abilities when aligned with human preferences through RLHF. The key insight is that model averaging (interpolating between pre-RLHF and post-RLHF model weights) is surprisingly effective at mitigating this trade-off because tasks share overlapping feature spaces, particularly in lower layers of the model. Building on this understanding, they propose Heterogeneous Model Averaging (HMA) which applies different averaging ratios to different layers of the transformer model, allowing optimization of the alignment-forgetting trade-off. The intuition is that since different layers capture different levels of features and task similarities, they should not be averaged equally, and finding optimal layer-specific averaging ratios can better preserve both alignment and pre-trained capabilities.</td>
<td><a href="https://arxiv.org/pdf/2309.06256" target="_blank">Link</a></td>
</tr>
<tr>
<td>AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising</td>
<td>Zigeng Chen et al</td>
<td>2024</td>
<td>diffusion, parallelization, denoising</td>
<td>Arxiv</td>
<td>This paper introduces AsyncDiff, a novel approach to accelerate diffusion models through parallel processing across multiple devices. The key insight is that hidden states between consecutive diffusion steps are highly similar, which allows them to break the traditional sequential dependency chain of the denoising process by transforming it into an asynchronous one. They execute this by dividing the denoising model into multiple components distributed across different devices, where each component uses the output from the previous component's prior step as an approximation of its input, enabling parallel computation. To further enhance efficiency, they introduce stride denoising, which completes multiple denoising steps simultaneously through a single parallel computation batch and reduces the frequency of communication between devices. This solution is particularly elegant because it's universal and plug-and-play, requiring no model retraining or architectural changes to achieve significant speedups while maintaining generation quality.</td>
<td><a href="https://arxiv.org/pdf/2406.06911" target="_blank">Link</a></td>
</tr>
<tr>
<td>DoRA: Weight-Decomposed Low-Rank Adaptation</td>
<td>Shih-Yang Liu et al</td>
<td>2024</td>
<td>peft, lora</td>
<td>Arxiv</td>
<td>This paper introduces DoRA (Weight-Decomposed Low-Rank Adaptation), a novel parameter-efficient fine-tuning method that decomposes pre-trained weights into magnitude and direction components for separate optimization. Through a detailed weight decomposition analysis, the authors reveal that LoRA and full fine-tuning exhibit distinct learning patterns, with LoRA showing proportional changes in magnitude and direction while full fine-tuning demonstrates more nuanced, independent adjustments between these components. Based on this insight, DoRA uses LoRA specifically for directional updates while allowing independent magnitude optimization, which simplifies the learning task compared to having LoRA learn both components simultaneously. The authors also provide theoretical analysis showing how this decomposition benefits optimization by aligning the gradient's covariance matrix more closely with the identity matrix and demonstrate mathematically why DoRA's learning pattern more closely resembles full fine-tuning.</td>
<td><a href="https://arxiv.org/pdf/2402.09353" target="_blank">Link</a></td>
</tr>
<tr>
<td>SphereFed: Hyperspherical Federated Learning</td>
<td>Xin Dong et al</td>
<td>2022</td>
<td>federated learning</td>
<td>Arxiv</td>
<td>This paper presents a novel approach to addressing the non-i.i.d. (non-independent and identically distributed) data challenge in federated learning by introducing hyperspherical federated learning (SphereFed). The key insight is that instead of letting clients independently learn their classifiers, which leads to inconsistent learning targets across clients, they should share a fixed classifier whose weights span a unit hypersphere, ensuring all clients work toward the same learning objectives. The approach normalizes features to project them onto this same hypersphere and uses mean squared error loss instead of cross-entropy to avoid scaling issues that arise when working with normalized features. Finally, after federated training is complete, they propose a computationally efficient way to calibrate the classifier using a closed-form solution that can be computed in a distributed manner without requiring direct access to private client data.</td>
<td><a href="https://arxiv.org/pdf/2207.09413" target="_blank">Link</a></td>
</tr>
<tr>
<td>A deeper look at depth pruning of LLMs</td>
<td>Shoaib Ahmed Siddiqui et al</td>
<td>2024</td>
<td>pruning, depth pruning, llm</td>
<td>ICML</td>
<td>This paper explored different approaches to pruning large language models, revealing that while static metrics like cosine similarity work well for maintaining MMLU performance, adaptive metrics like Shapley values show interesting trade-offs between different tasks. A key insight was that self-attention layers are significantly more amenable to pruning compared to feed-forward layers, suggesting that models can maintain performance even with substantial attention layer reduction. The paper also demonstrated that simple performance recovery techniques, like applying an average update in place of removed layers, can be as effective or better than more complex approaches like low-rank adapters. Finally, the work highlighted how pruning affects different tasks unequally - while some metrics preserve performance on certain tasks like MMLU, they may significantly degrade performance on others like mathematical reasoning tasks.</td>
<td><a href="https://www.arxiv.org/pdf/2407.16286" target="_blank">Link</a></td>
</tr>
<tr>
<td>Editing Models with Task Arithmetic</td>
<td>Gabriel Ilharco et al</td>
<td>2023</td>
<td>task arithmetic, finetuning, task</td>
<td>ICLR</td>
<td>This paper introduces a novel method for model editing called task arithmetic, where "task vectors" represent specific tasks by capturing the difference between pre-trained and fine-tuned model weights. Task vectors can be manipulated mathematically, such as being negated to unlearn tasks or added together to enable multi-tasking or improve performance in novel settings. A standout finding is the ability to create new task capabilities through analogies (e.g., "A is to B as C is to D"), which allows performance improvement on tasks with little or no data. This method is computationally efficient, leveraging linear operations on model weights without incurring extra inference costs, providing a flexible and modular framework for modifying models post-training. The approach highlights significant advantages in adapting existing models while bypassing costly re-training or data access constraints.</td>
<td><a href="https://arxiv.org/pdf/2212.04089" target="_blank">Link</a></td>
</tr>
<tr>
<td>SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales</td>
<td>Tianyang Xu et al</td>
<td>2024</td>
<td>confidence estimation, llm</td>
<td>Arxiv</td>
<td>The SaySelf framework trains large language models (LLMs) to produce fine-grained confidence estimates and self-reflective rationales by focusing on internal uncertainties. It consists of two stages: supervised fine-tuning and reinforcement learning (RL). In the first stage, multiple reasoning chains are sampled from the LLM, clustered for semantic similarity, and analyzed by an advanced LLM to generate rationales summarizing uncertainties. The model is fine-tuned on a dataset that pairs questions with reasoning chains, rationales, and confidence estimates, using a loss function that optimizes the generation of all three outputs. In the second stage, RL refines the confidence predictions using a reward function that encourages accurate, high-confidence outputs while penalizing overconfidence in incorrect responses. The framework ensures that LLMs not only generate confidence scores but also provide explanations for their uncertainty, making their outputs more interpretable and calibrated.</td>
<td><a href="https://arxiv.org/pdf/2405.20974" target="_blank">Link</a></td>
</tr>
<tr>
<td>Deep Reinforcement Learning from Human Preferences</td>
<td>Paul F Christiano et al</td>
<td>2016</td>
<td>rl, rlhf</td>
<td>Arxiv</td>
<td>This paper introduces a method to train reinforcement learning (RL) systems using human preferences over trajectory segments rather than traditional reward functions. The approach allows agents to learn tasks that are hard to define programmatically, enabling non-expert users to provide feedback on agent behavior through comparisons of short video clips. By learning a reward model from these preferences, the method dramatically reduces the need for human oversight while maintaining adaptability to large-scale and complex RL environments. This paradigm bridges the gap between human-defined objectives and scalable RL systems, addressing challenges in alignment and usability for real-world applications.</td>
<td><a href="https://arxiv.org/pdf/1706.03741" target="_blank">Link</a></td>
</tr>
<tr>
<td>The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning</td>
<td>Tian Jin et al</td>
<td>2023</td>
<td>pruning, icl</td>
<td>Arxiv</td>
<td>This paper explores the effects of scaling the parameter count of large language models (LLMs) on two distinct capabilities: fact recall from pre-training and in-context learning (ICL). By investigating both dense scaling (training models of varying sizes) and pruning (removing weights), the authors identify that these approaches disproportionately affect fact recall while preserving ICL abilities. They demonstrate that a model's ability to learn from in-context information remains robust under significant parameter reductions, whereas the ability to recall pre-trained facts degrades with even moderate scaling down. This dichotomy highlights a fundamental difference in how these capabilities rely on model size and opens avenues for more efficient model design and deployment, emphasizing trade-offs between memory augmentation and parameter efficiency.</td>
<td><a href="https://arxiv.org/pdf/2310.04680" target="_blank">Link</a></td>
</tr>
<tr>
<td>Fine-Tuning Language Models with Just Forward Passes</td>
<td>Sadhika Malladi et al</td>
<td>2024</td>
<td>finetuning, zo, optimization</td>
<td>Arxiv</td>
<td>The paper introduces MeZO, a memory-efficient zeroth-order optimization method, to fine-tune large language models using forward passes alone. Classical zeroth-order methods scale poorly with model size, but MeZO adapts these approaches to leverage structured pre-trained model landscapes, avoiding catastrophic slowdown even with billions of parameters. The authors theoretically show that MeZO’s convergence depends on the local effective rank of the Hessian, not the number of parameters, enabling efficient optimization despite prior bounds suggesting otherwise. Furthermore, MeZO’s flexibility allows optimization of non-differentiable objectives (e.g., accuracy or F1 score) and compatibility with parameter-efficient tuning methods like LoRA and prefix-tuning.</td>
<td><a href="https://arxiv.org/pdf/2305.17333" target="_blank">Link</a></td>
</tr>
<tr>
<td>ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference</td>
<td>Hanshi Sun et al</td>
<td>2024</td>
<td>kv cache</td>
<td>Arxiv</td>
<td>The key insight of this paper lies in optimizing long-context large language model inference by addressing the memory and latency bottlenecks associated with managing the key-value (KV) cache. The authors observe that pre-Rotary Position Embedding (RoPE) keys exhibit a low-rank structure, allowing them to be compressed without accuracy loss, while value caches lack this property and are therefore offloaded to the CPU to reduce GPU memory usage. To minimize decoding latency, they leverage landmarks—compact representations of the low-rank key cache—and identify a small set of outliers to be retained on the GPU, enabling efficient reconstruction of sparse KV pairs on-the-fly. This approach allows the system to handle significantly longer contexts and larger batch sizes while maintaining inference throughput and accuracy.</td>
<td><a href="https://arxiv.org/pdf/2410.21465" target="_blank">Link</a></td>
</tr>
<tr>
<td>LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning</td>
<td>Rui Pan et al</td>
<td>2024</td>
<td>peft, finetuning, sampling</td>
<td>Arxiv</td>
<td>The key insight of this paper is the discovery of a skewed weight-norm distribution across layers during LoRA fine-tuning, where the majority of updates occur in the bottom (embedding) and top (language modeling head) layers, leaving middle layers underutilized. This highlights that different layers have varied importance and suggests that selectively updating layers could improve efficiency without sacrificing performance. Building on this, the authors propose Layerwise Importance Sampling AdamW (LISA), which randomly freezes most middle layers during training, using importance sampling to emulate LoRA’s fast learning pattern while avoiding its low-rank constraints. This approach achieves significant memory savings, faster convergence, and superior performance compared to LoRA and full-parameter fine-tuning, particularly in large-scale and domain-specific tasks.</td>
<td><a href="https://arxiv.org/pdf/2403.17919" target="_blank">Link</a></td>
</tr>
<tr>
<td>SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion</td>
<td>Muyang Li et al</td>
<td>2024</td>
<td>quantization, diffusion</td>
<td>Arxiv</td>
<td>SVDQuant introduces a novel approach to 4-bit quantization of diffusion models by using a low-rank branch to absorb outliers in both weights and activations, making quantization more feasible at such aggressive bit reduction. The method first consolidates outliers from activations to weights through smoothing, then decomposes the weights using Singular Value Decomposition (SVD) to separate the dominant components into a 16-bit low-rank branch while keeping the residual in 4 bits. To make this practical, they developed an inference engine called Nunchaku that fuses the low-rank and low-bit branch kernels together, eliminating redundant memory access that would otherwise negate the performance benefits. The approach is designed to work across different diffusion model architectures and can seamlessly integrate with existing low-rank adapters (LoRAs) without requiring re-quantization.</td>
<td><a href="https://arxiv.org/pdf/2411.05007" target="_blank">Link</a></td>
</tr>
<tr>
<td>One Weight Bitwidth to Rule Them All</td>
<td>Ting-Wu Chin et al</td>
<td>2020</td>
<td>quantization, bitwidth</td>
<td>Arxiv</td>
<td>This paper examines weight quantization in deep neural networks and challenges the common assumption that using the lowest possible bitwidth without accuracy loss is optimal. The key insight is that when considering model size as a constraint and allowing network width to vary, some bitwidths consistently outperform others - specifically, networks with standard convolutions work better with binary weights while networks with depthwise convolutions prefer higher bitwidths. The authors discover that this difference is related to the number of input channels (fan-in) per convolutional kernel, with higher fan-in making networks more resilient to aggressive quantization. Most surprisingly, they demonstrate that using a single well-chosen bitwidth throughout the network can outperform more complex mixed-precision quantization approaches when comparing networks of equal size, suggesting that the traditional focus on minimizing bitwidth without considering network width may be suboptimal.</td>
<td><a href="https://arxiv.org/pdf/2008.09916" target="_blank">Link</a></td>
</tr>
<tr>
<td>Consistency Models</td>
<td>Yang Song et al</td>
<td>2023</td>
<td>diffusion, ode, consistency</td>
<td>ICML</td>
<td>This paper introduces consistency models, a new family of generative models that can generate high-quality samples in a single step while preserving the ability to trade compute for quality through multi-step sampling. The key innovation is training models to map any point on a probability flow ODE trajectory to its origin point, enforcing consistency across different time steps through either distillation from pre-trained diffusion models or direct training. The models support zero-shot data editing capabilities like inpainting, colorization, and super-resolution without requiring explicit training on these tasks, similar to diffusion models. The authors provide two training approaches - consistency distillation which leverages existing diffusion models, and consistency training which allows training from scratch without any pre-trained models, establishing consistency models as an independent class of generative models.</td>
<td><a href="https://arxiv.org/pdf/2303.01469" target="_blank">Link</a></td>
</tr>
<tr>
<td>One Step Diffusion via ShortCut Models</td>
<td>Kevin Frans et al</td>
<td>2024</td>
<td>diffusion, ode, flow-matching</td>
<td>Arxiv</td>
<td>This paper introduces shortcut models, a new type of diffusion model that enables high-quality image generation in a single forward pass by conditioning the model not only on the timestep but also on the desired step size, allowing it to learn larger jumps during the denoising process. Unlike previous approaches that require multiple training phases or complex scheduling, shortcut models can be trained end-to-end in a single phase by leveraging a self-consistency property where one large step should equal two consecutive smaller steps, combined with flow-matching loss as a base case. The key insight is that by conditioning on step size, the model can account for future curvature in the denoising path and jump directly to the correct next point rather than following the curved path naively, which would lead to errors with large steps. The approach simplifies the training pipeline while maintaining flexibility in inference budget, as the same model can generate samples using either single or multiple steps after training.</td>
<td><a href="https://arxiv.org/abs/2410.12557" target="_blank">Link</a></td>
</tr>
<tr>
<td>Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models</td>
<td>Hongjie Wang et al</td>
<td>2024</td>
<td>diffusion, training-free, attention, token pruning</td>
<td>CVPR</td>
<td>This paper introduces AT-EDM, a training-free framework to accelerate diffusion models by pruning redundant tokens during inference without requiring model retraining. The key innovation is a Generalized Weighted PageRank (G-WPR) algorithm that uses attention maps to identify and prune less important tokens, along with a novel similarity-based token recovery method that fills in pruned tokens based on attention patterns to maintain compatibility with convolutional layers. The authors also propose a Denoising-Steps-Aware Pruning (DSAP) schedule that prunes fewer tokens in early denoising steps when attention maps are more chaotic and less informative, and more tokens in later steps when attention patterns are better established. The overall approach focuses on making diffusion models more efficient by leveraging the rich information contained in attention maps to guide token pruning decisions while maintaining image generation quality.</td>
<td><a href="https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Attention-Driven_Training-Free_Efficiency_Enhancement_of_Diffusion_Models_CVPR_2024_paper.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks</td>
<td>Tim Salimans et al</td>
<td>2016</td>
<td>normalization, gradient descent</td>
<td>Arxiv</td>
<td>This paper introduces weight normalization, a simple reparameterization technique that decouples a neural network's weight vectors into their direction and magnitude by expressing w = (g/||v||)v, where g is a scalar and v is a vector. The key insight is that this decoupling improves optimization by making the conditioning of the gradient better - the direction and scale of weight updates can be learned somewhat independently, which helps avoid problems with pathological curvature in the optimization landscape. While inspired by batch normalization, weight normalization is deterministic and doesn't add noise to gradients or create dependencies between minibatch examples, making it well-suited for scenarios like reinforcement learning and RNNs where batch normalization is problematic. The authors also propose a data-dependent initialization scheme where g and bias terms are initialized to normalize the initial pre-activations of neurons, helping ensure good scaling of activations across layers at the start of training.</td>
<td><a href="https://arxiv.org/pdf/1602.07868" target="_blank">Link</a></td>
</tr>
<tr>
<td>Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models</td>
<td>Tuomas Kynkäänniemi et al</td>
<td>2024</td>
<td>diffusion, cfg, guidance</td>
<td>Arxiv</td>
<td>This paper's key insight is that classifier-free guidance (CFG) in diffusion models should only be applied during a specific interval of noise levels in the middle of the sampling process, rather than throughout the entire sampling chain as traditionally done. The intuition is that guidance is harmful at high noise levels (where it causes mode collapse and template-like outputs), largely unnecessary at low noise levels, and only truly beneficial in the middle range. They demonstrate this theoretically using a 1D synthetic example where they can visualize how guidance at high noise levels causes sampling trajectories to drift far from the smoothed data distribution, leading to mode dropping. Beyond this theoretical demonstration, they propose a simple solution of making the guidance weight a piecewise function that only applies guidance within a specific noise level interval.</td>
<td><a href="https://arxiv.org/pdf/2404.07724" target="_blank">Link</a></td>
</tr>
<tr>
<td>Cache Me if You Can: Accelerating Diffusion Models through Block Caching</td>
<td>Felix Wimbauer et al</td>
<td>2024</td>
<td>diffusion, caching, distillation</td>
<td>Arxiv</td>
<td>This paper introduces "block caching" to accelerate diffusion models by reusing computations across denoising steps. The key insight is that many layer blocks (particularly attention blocks) in diffusion models change very gradually during the denoising process, making their repeated computation redundant. The authors propose automatically determining which blocks to cache and when to refresh them based on measuring the relative changes in block outputs across timesteps. They also introduce a lightweight scale-shift adjustment mechanism that uses a student-teacher setup, where the student (cached model) learns additional scale and shift parameters to better align its cached block outputs with those of the teacher (uncached model), while keeping the original model weights frozen.</td>
<td><a href="https://arxiv.org/pdf/2312.03209" target="_blank">Link</a></td>
</tr>
<tr>
<td>DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads</td>
<td>Guangxuan Xiao et al</td>
<td>2024</td>
<td>llm, kv cache, attention</td>
<td>Arxiv</td>
<td>The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper's approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model's capabilities while reducing memory usage, since it aligns the memory allocation with each head's actual needs.</td>
<td><a href="https://arxiv.org/pdf/2410.10819" target="_blank">Link</a></td>
</tr>
<tr>
<td>Efficient Streaming Language Models with Attention Sinks</td>
<td>Guangxuan Xiao et al</td>
<td>2024</td>
<td>llm, kv cache, attention</td>
<td>ICLR</td>
<td>This paper introduces StreamingLLM, a framework that enables large language models to process infinitely long text sequences efficiently without fine-tuning, based on a key insight about "attention sinks." The authors discover that LLMs allocate surprisingly high attention scores to initial tokens regardless of their semantic relevance, which they explain is due to the softmax operation requiring attention scores to sum to one - even when a token has no strong matches in context, the model must distribute attention somewhere, and initial tokens become natural "sinks" since they're visible to all subsequent tokens during autoregressive training. Building on this insight, StreamingLLM maintains just a few initial tokens (as attention sinks) along with a sliding window of recent tokens, achieving up to 22.2x speedup compared to baselines while maintaining performance on sequences up to 4 million tokens long. Additionally, they show that incorporating a dedicated learnable "sink token" during model pre-training can further improve streaming capabilities by providing an explicit token for collecting excess attention.</td>
<td><a href="https://arxiv.org/pdf/2309.17453" target="_blank">Link</a></td>
</tr>
<tr>
<td>MagicPIG: LSH Sampling for Efficient LLM Generation</td>
<td>Zhuoming Chen et al</td>
<td>2024</td>
<td>llm, kv cache</td>
<td>Arxiv</td>
<td>This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy.</td>
<td><a href="https://arxiv.org/pdf/2410.16179" target="_blank">Link</a></td>
</tr>
<tr>
<td>Guiding a Diffusion Model with a Bad Version of Itself</td>
<td>Tero Karras et al</td>
<td>2024</td>
<td>diffusion, guidance</td>
<td>Arxiv</td>
<td>The paper makes two key contributions: First, they show that Classifier-Free Guidance (CFG) improves image quality not just through better prompt alignment, but because the unconditional model D0 learns a more spread-out distribution than the conditional model D1, causing the guidance term ∇x log(p1/p0) to push samples toward high-probability regions of the data manifold. Second, based on this insight, they introduce "autoguidance" - using a smaller, less-trained version of the model itself as the guiding model D0 rather than an unconditional model, which allows for quality improvements without reducing variation and works even for unconditional models.</td>
<td><a href="https://arxiv.org/pdf/2406.02507" target="_blank">Link</a></td>
</tr>
<tr>
<td>LLM-Pruner: On the Structural Pruning of Large Language Models</td>
<td>Xinyin Ma et al</td>
<td>2023</td>
<td>llm, structural pruning</td>
<td>Arxiv</td>
<td>The authors introduce LLM-Pruner, a novel approach for compressing large language models that operates in a task-agnostic manner while requiring minimal access to the original training data. Their key insight is to first automatically identify groups of interdependent neural structures within the LLM by analyzing dependency patterns, ensuring that coupled structures are pruned together to maintain model coherence. The method then estimates the importance of these structural groups using both first-order gradients and approximated Hessian information from a small set of calibration samples, allowing them to selectively remove less critical groups while preserving the model's core functionality. Finally, they employ a rapid recovery phase using low-rank adaptation (LoRA) to fine-tune the pruned model with a limited dataset in just a few hours, enabling efficient compression while maintaining the LLM's general-purpose capabilities.</td>
<td><a href="https://arxiv.org/pdf/2305.11627" target="_blank">Link</a></td>
</tr>
<tr>
<td>SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models</td>
<td>Guangxuan Xiao et al</td>
<td>2023</td>
<td>llm, quantization, activations</td>
<td>ICML</td>
<td>The key insight of SmoothQuant is that in large language models, while weights are relatively easy to quantize, activations are much harder due to outliers. They observed that these outliers persistently appear in specific channels across different tokens, suggesting that the difficulty could be redistributed. Their solution is to mathematically transform the model by scaling down problematic activation channels while scaling up the corresponding weight channels proportionally, which maintains mathematical equivalence while making both weights and activations easier to quantize. This "difficulty migration" approach allows them to balance the quantization challenges between weights and activations using a tunable parameter α, rather than having all the difficulty concentrated in the activation values.</td>
<td><a href="https://arxiv.org/pdf/2211.10438" target="_blank">Link</a></td>
</tr>
<tr>
<td>ESPACE: Dimensionality Reduction of Activations for Model Compression</td>
<td>Charbel Sakr et al</td>
<td>2024</td>
<td>llm, dimensionality reduction, activations, compression</td>
<td>NeurIPS</td>
<td>Instead of decomposing weight matrices as done in previous work, ESPACE reduces the dimensionality of activation tensors by projecting them onto a pre-calibrated set of principal components using a static projection matrix P, where for an activation x, its projection is x̃ = PPᵀx. The projection matrix P is carefully constructed (using eigendecomposition of activation statistics) to preserve the most important components while reducing dimensionality, taking advantage of natural redundancies that exist in activation patterns due to properties like the Central Limit Theorem when stacking sequence/batch dimensions. During training, the weights remain uncompressed and fully trainable (maintaining model expressivity), while at inference time, the weight matrices can be pre-multiplied with the projection matrix (PTWᵀ) to achieve compression through matrix multiplication associativity: Y = WᵀX ≈ Wᵀ(PPᵀX) = (PTWᵀ)(PᵀX). This activation-centric approach is fundamentally different from previous methods because it maintains full model expressivity during training while still achieving compression at inference time, and it takes advantage of natural statistical redundancies in activation patterns rather than trying to directly compress weights.</td>
<td><a href="https://arxiv.org/pdf/2410.05437" target="_blank">Link</a></td>
</tr>
<tr>
<td>Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model</td>
<td>Chunting Zhou et al</td>
<td>2024</td>
<td>diffusion, transformer, multi-modal</td>
<td>Arxiv</td>
<td>The key insight of this paper is that a single transformer model can effectively handle both discrete data (like text) and continuous data (like images) by using different training objectives for each modality within the same model. They introduce "Transfusion," which uses traditional language modeling (next token prediction) for text sequences while simultaneously applying diffusion modeling for image sequences, combining these distinct objectives into a unified training approach. The architecture employs a novel attention pattern that allows for causal attention across the entire sequence while enabling bidirectional attention within individual images, letting image patches attend to each other freely while maintaining proper causality for text generation. This unified approach avoids the need for separate specialized models or complex architectures while still allowing each modality to be processed according to its most effective paradigm.</td>
<td><a href="https://arxiv.org/pdf/2408.11039" target="_blank">Link</a></td>
</tr>
<tr>
<td>GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection</td>
<td>Jiawei Zhao et al</td>
<td>2024</td>
<td>lora, low-rank projection</td>
<td>ICML</td>
<td>This paper introduces GaLore, a memory-efficient approach for training large language models that exploits the inherent low-rank structure of gradients rather than imposing low-rank constraints on the model weights themselves. The key insight is that while weight matrices may need to be full-rank for optimal performance, their gradients naturally become low-rank during training due to the specific structure of backpropagated gradients in neural networks, particularly in cases where the batch size is smaller than the matrix dimensions or when the gradients follow certain parametric forms. Building on this observation, GaLore projects gradients into low-rank spaces for memory-efficient optimization while still allowing full-parameter learning, contrasting with previous approaches like LoRA that restrict the weight updates to low-rank spaces. By periodically switching between different low-rank subspaces during training, GaLore maintains the flexibility of full-rank training while significantly reducing memory usage, particularly in storing optimizer states.</td>
<td><a href="https://arxiv.org/pdf/2403.03507" target="_blank">Link</a></td>
</tr>
<tr>
<td>Neural Discrete Representation Learning</td>
<td>Aaron van der Oord et al</td>
<td>2017</td>
<td>generative models, vae</td>
<td>NeurIPS</td>
<td>The key innovation of this paper is the introduction of the Vector Quantised-Variational AutoEncoder (VQ-VAE), which combines vector quantization with VAEs to learn discrete latent representations instead of continuous ones. Unlike previous approaches to discrete latent variables which struggled with high variance or optimization challenges, VQ-VAE uses a simple but effective nearest-neighbor lookup system in the latent space, along with a straight-through gradient estimator, to learn meaningful discrete codes. This approach allows the model to avoid the common posterior collapse problem where latents are ignored when paired with powerful decoders, while still maintaining good reconstruction quality comparable to continuous VAEs. The discrete nature of the latent space enables the model to focus on capturing important high-level features that span many dimensions in the input space (like objects in images or phonemes in speech) rather than local details, and these discrete latents can then be effectively modeled using powerful autoregressive priors for generation.</td>
<td><a href="https://arxiv.org/pdf/1711.00937" target="_blank">Link</a></td>
</tr>
<tr>
<td>Improved Precision and Recall Metric for Assessing Generative Models</td>
<td>Tuomas Kynkaanniemi et al</td>
<td>2019</td>
<td>generative models, precision, recall</td>
<td>NeurIPS</td>
<td>This paper introduces an improved metric for evaluating generative models by separately measuring precision (quality of generated samples) and recall (coverage/diversity of generated distribution) using k-nearest neighbors to construct non-parametric manifold approximations of real and generated data distributions. The authors demonstrate their metric's effectiveness using StyleGAN and BigGAN, showing how it provides more nuanced insights than existing metrics like FID, particularly in revealing tradeoffs between image quality and variation that other metrics obscure. They use their metric to analyze and improve StyleGAN's architecture and training configurations, identifying new variants that achieve state-of-the-art results, and perform the first principled analysis of truncation methods. Finally, they extend their metric to evaluate individual sample quality, enabling quality assessment of interpolations and providing insights into the shape of the latent space that produces realistic images.</td>
<td><a href="https://arxiv.org/pdf/1904.06991" target="_blank">Link</a></td>
</tr>
<tr>
<td>Generative Pretraining from Pixels</td>
<td>Mark Chen et al</td>
<td>2020</td>
<td>pretraining, gpt</td>
<td>PMLR</td>
<td>The paper demonstrates that transformer models can learn high-quality image representations by simply predicting pixels in a generative way, without incorporating any knowledge of the 2D structure of images. They show that as the generative models get better at predicting pixels (measured by log probability), they also learn better representations that can be used for downstream image classification tasks. The authors discover that, unlike in supervised learning where the best representations are in the final layers, their generative models learn the best representations in the middle layers - suggesting the model first builds up representations before using them to predict pixels. Finally, while their approach requires significant compute and works best at lower resolutions, it achieves competitive results with other self-supervised methods and shows that generative pre-training can be a promising direction for learning visual representations without labels.</td>
<td><a href="https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Why Does Unsupervised Pre-Training Help Deep Learning?</td>
<td>Dumitru Erhan et al</td>
<td>2010</td>
<td>pretraining, unsupervised</td>
<td>JMLR</td>
<td>This paper argues that standard training schemes place parameters in regions of the parameter space that generalize poorly, while greedy layer-wise unsupervised pre-training allows each layer to learn a nonlinear transformation of its input that captures the main variations in the input, which acts as a regularizer: minimizing variance and introducing bias towards good initializations for the parameters. They argue that defining particular initialization points implicitly imposes constraints on the parameters in that it specifies which minima (out of many possible minima) of the cost function are allowed. They further argue that small perturbations in the trajectory of the parameters have a larger effect early on, and hint that early examples have larger influence and may trap model parameters in particular regions of parameter space corresponding to the arbitrary ordering of training examples (similar to the "critical period" in developmental psychology).</td>
<td><a href="https://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Improving Language Understanding by Generative Pre-Training</td>
<td>Alec Radford et al</td>
<td>2020</td>
<td>pretraining</td>
<td>Arxiv</td>
<td>The key insight of this paper is that language models can learn deep linguistic and world knowledge through unsupervised pre-training on large corpora of contiguous text, which can then be effectively transferred to downstream tasks. The authors demonstrate this by using a Transformer architecture that can capture long-range dependencies, pre-training it on a books dataset that contains extended narratives rather than shuffled sentences, making it particularly effective at understanding context. Their innovation extends to how they handle transfer learning - rather than creating complex task-specific architectures, they show that simple input transformations can adapt their pre-trained model to various tasks while preserving its learned capabilities. This elegant approach proves remarkably effective, with their single task-agnostic model outperforming specially-designed architectures across nine different natural language understanding tasks, suggesting that their pre-training method captures fundamental aspects of language understanding.</td>
<td><a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Learning Transferable Visual Models from Natural Language Supervision</td>
<td>Alec Radford et al</td>
<td>2021</td>
<td>CLIP</td>
<td>Arxiv</td>
<td>CLIP (Contrastive Language-Image Pre-training) works by simultaneously training two neural networks - one that encodes images and another that encodes text - to project their inputs into a shared multi-dimensional space where similar concepts end up close together. During training, CLIP takes a batch of image-text pairs and learns to identify which text descriptions actually match which images, doing this by maximizing the cosine similarity between embeddings of genuine pairs while minimizing similarity between mismatched pairs. The training data consists of hundreds of millions of (image, text) pairs collected from the internet, which helps CLIP learn broad visual concepts and their relationships to language without requiring hand-labeled data. What makes CLIP particularly powerful is its zero-shot capability - after training, it can make predictions about images it has never seen before by comparing them against any arbitrary text descriptions, rather than being limited to a fixed set of predetermined labels.</td>
<td><a href="https://arxiv.org/pdf/2103.00020" target="_blank">Link</a></td>
</tr>
<tr>
<td>Adam: A Method for Stochastic Optimization</td>
<td>Diederik Kingma et al</td>
<td>2015</td>
<td>optimizers</td>
<td>ICLR</td>
<td>Adam combines momentum (through exponential moving average of gradients mt) and adaptive learning rates (through exponential moving average of squared gradients vt) to create an efficient optimizer, where mt captures the direction of updates while vt adapts the step size for each parameter based on its gradient history. The optimizer corrects initialization bias in these moving averages by scaling them with factors 1/(1-β₁ᵗ) and 1/(1-β₂ᵗ) respectively, ensuring unbiased estimates even in early training. The parameter update θt ← θt-1 - α·mt/(√vt + ϵ) is invariant to gradient scaling because it uses the ratio mt/√vt, while the adaptive learning rate 1/√vt approximates the diagonal of the Fisher Information Matrix's square root, making it a more conservative version of natural gradient descent that works well with sparse gradients and non-stationary objectives. The hyperparameters β₁ = 0.9 and β₂ = 0.999 mean the momentum term considers roughly the last 10 steps while the variance term considers the last 1000 steps, allowing Adam to both move quickly in consistent directions while being careful in directions with high historical variance.</td>
<td><a href="https://arxiv.org/pdf/1412.6980" target="_blank">Link</a></td>
</tr>
<tr>
<td>Simplifying Neural Networks by Soft Weight-Sharing</td>
<td>Steven Nowlan et al</td>
<td>1992</td>
<td>soft weight sharing, mog</td>
<td>Neural Computation</td>
<td>This paper tackles the challenge of penalizing complexity and preventing overfitting in neural networks. Traditional methods, like L2 regularization, penalize the sum of squared weights but can favor multiple weak connections over a single strong one, leading to suboptimal weight configurations. To address this, the authors propose a mixture of Gaussians (MoG) prior: a narrow Gaussian encourages small weights to shrink to zero, while a broad Gaussian preserves large weights essential for modeling the data accurately. By clustering weights into near-zero and larger groups, this data-driven regularization avoids forcing all weights toward zero equally and demonstrates better generalization on 12 toy tasks compared to early stopping and traditional squared-weight penalties.</td>
<td><a href="https://www.cs.toronto.edu/~hinton/absps/sunspots.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models</td>
<td>Muyang Li et al</td>
<td>2024</td>
<td>diffusion, distributed inference</td>
<td>Arxiv</td>
<td>DistriFusion introduces *displaced patch parallelism*, where the input image is split into patches, each processed independently by different GPUs. To maintain fidelity and reduce communication costs, the method reuses activations from the previous timestep as context for the current step, ensuring interaction between patches without excessive synchronization. Synchronous communication is only used at the initial step, while subsequent steps leverage asynchronous communication, hiding communication overhead within computation. This technique allows each device to process only a portion of the workload efficiently, avoiding artifacts and achieving scalable parallelism tailored to the sequential nature of diffusion models.</td>
<td><a href="https://arxiv.org/pdf/2402.19481" target="_blank">Link</a></td>
</tr>
<tr>
<td>Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching</td>
<td>Xinyin Ma et al</td>
<td>2024</td>
<td>diffusion, caching</td>
<td>Arxiv</td>
<td>This paper proposes interpolation between computationally inexpensive solutions that are suboptimal and optimal solutions that are expensive by training a router the learn how to cache layers of the diffusion transformer.</td>
<td><a href="https://arxiv.org/pdf/2406.01733" target="_blank">Link</a></td>
</tr>
<tr>
<td>Flash Attention</td>
<td>Tri Dao et al</td>
<td>2022</td>
<td>attention, transformer</td>
<td>Arxiv</td>
<td>This introduces FlashAttention, which is an IO-aware exact attention algo that uses tiling. Basically, they use tiling to prevent needing to put the large NxN attention matrix on GPU HBM; FlashAttention goes through blocks of the K and V matrices, loads them to on-chip SRAM, which increases speed! Neat!</td>
<td><a href="https://arxiv.org/pdf/2205.14135" target="_blank">Link</a></td>
</tr>
<tr>
<td>Token Merging for Fast Stable Diffusion</td>
<td>Daniel Bolya et al</td>
<td>2023</td>
<td>diffusion, token merging</td>
<td>Arxiv</td>
<td>This paper seeks to apply ToMe (https://arxiv.org/pdf/2210.09461) to diffusion models, introducing techniques for token partitioning (by changing the way src and dst is merged) and a token unmerging operation (which is basically just setting the two merged tokens equal to their average, and then resetting back the two tokens with that average). Remarkably, this works very well!</td>
<td><a href="https://arxiv.org/pdf/2303.17604" target="_blank">Link</a></td>
</tr>
<tr>
<td>DeepCache: Accelerating Diffusion Models for Free</td>
<td>Xinyin Ma et al</td>
<td>2023</td>
<td>diffusion, cache</td>
<td>Arxiv</td>
<td>Similarly to Faster Diffusion (Senma Li et al, 2024), this paper uses the temporal redundancy in the denoising stages. They then cache features across the UNet by skipping some of the skip branches / paths. Basically, for timesteps t and t+1 that are similar, we can cache some of the high level features between them and directly use them. Also smart!</td>
<td><a href="https://arxiv.org/pdf/2312.00858" target="_blank">Link</a></td>
</tr>
<tr>
<td>Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models</td>
<td>Senmao Li et al</td>
<td>2024</td>
<td>diffusion, encoder</td>
<td>NeurIPS</td>
<td>This paper notes that the UNet decoder in diffusion models has similar output between timesteps. Thus, they seek to basically cyclically reuse encoder features for the decoder. Smart!</td>
<td><a href="https://arxiv.org/pdf/2312.09608" target="_blank">Link</a></td>
</tr>
<tr>
<td>Improved Denoising Diffusion Probabilistic Models</td>
<td>Alex Nichol et al</td>
<td>2021</td>
<td>diffusion, precision, recall</td>
<td>Arxiv</td>
<td>This paper is the first to show that DDPMs can get competitive log-likelihoods. They use a reparameterization and a hybrid learning objective to more tightly optimize the variational lower bound, and find that their objective has less gradient noise during training. They use learned variances and find that they can get convincing samples using fewer steps. They also use the improved precision and recall metrics (Kynkaanniemi et al 2019) to show that diffusion models have higher recall for similar FID, which suggests they cover a large portion of the target distribution. They focused on optimizing log-likelihood as it is believed that optimizing ll forces the model to capture all models of data distribution (Razavi et al 2019). Heninghan et al 2020 has also shown that small improvements in ll can dramatically impact sample quality / learned feature representations. The authors argue that fixing \sigma_{t} (as Ho et al 2020 does) is reasonable in terms of sample quality, but does not explain much about the ll. Thus, to improve ll they think of finding a better choice for \Sigma_{\theta}(x_{t},t), so they choose to try to learn it. They note that it is better to parameterize the var as an interpolation between \beta_{t} and \tilde{\beta_{t}} in the log domain. Remember that \beta_{t} is the noise schedule, which is typically a small value that increases over time following some schedule. \tilde{\beta is a reparameterization of \beta_{t} used to simplify calculations. They are related via \alpha, which is 1-eta_{t}. Finally, they note that a linear schedule for noise leads to faster destroying of information than is necessary, and propose a different noise scheduler. Lots of insights!</td>
<td><a href="https://arxiv.org/pdf/2102.09672" target="_blank">Link</a></td>
</tr>
<tr>
<td>Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture</td>
<td>Huijie Zhang et al</td>
<td>2024</td>
<td>diffusion, multi-stage</td>
<td>CVPR</td>
<td>This paper proposes a multi-stage framework for diffusion models that uses a shared encoder and separate decoders for different timestep intervals, along with an optimal denoiser-based timestep clustering method, to improve training and sampling efficiency while maintaining or enhancing image generation quality.</td>
<td><a href="https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_Improving_Training_Efficiency_of_Diffusion_Models_via_Multi-Stage_Framework_and_CVPR_2024_paper.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Temporal Dynamic Quantization for Diffusion Models</td>
<td>Junhyuk So et al</td>
<td>2023</td>
<td>diffusion, quantization</td>
<td>NeurIPS</td>
<td>Temporal Dynamic Quantization (TDQ) addresses the challenge of quantizing diffusion models by dynamically adjusting quantization parameters based on the denoising time step. TDQ employs a trainable module consisting of frequency encoding, a multi-layer perceptron (MLP), and a SoftPlus activation to predict optimal quantization intervals for each time step. This module maps the temporal information to appropriate quantization parameters, allowing the method to adapt to the varying activation distributions across different stages of the diffusion process. By pre-computing these quantization intervals, TDQ avoids the runtime overhead associated with traditional dynamic quantization methods while still providing the necessary flexibility to handle the temporal dynamics of diffusion models.</td>
<td><a href="https://arxiv.org/pdf/2306.02316v2" target="_blank">Link</a></td>
</tr>
<tr>
<td>Learning Efficient Convolutional Networks through Network Slimming</td>
<td>Zhuang Liu et al</td>
<td>2017</td>
<td>pruning, importance</td>
<td>CVPR</td>
<td>This paper introduces *network slimming*, a method to reduce the size, memory footprint, and computation of CNNs by enforcing channel-level sparsity without sacrificing accuracy. It works by identifying and pruning insignificant channels during training, leveraging the γ scaling factors in Batch Normalization (BN) layers to effectively determine channel importance. The approach introduces minimal training overhead and is compatible with modern CNN architectures, eliminating the need for specialized hardware or software. Using the BN layer’s built-in scaling properties makes this pruning efficient, avoiding redundant scaling layers or issues that arise from linear transformations in convolution layers.</td>
<td><a href="https://arxiv.org/pdf/1708.06519" target="_blank">Link</a></td>
</tr>
<tr>
<td>Q-Diffusion: Quantizing Diffusion Models</td>
<td>Xiuyu Li et al</td>
<td>2023</td>
<td>diffusion, sampling</td>
<td>ICCV</td>
<td>This paper tackles the inefficiencies of diffusion models, such as slow inference and high computational cost, by proposing a post-training quantization (PTQ) method designed specifically for their multi-timestep process. The key innovation includes a *time step-aware calibration data sampling* approach, which uniformly samples inputs across multiple time steps to better reflect real inference data, addressing quantization errors and varying activation distributions without the need for additional data. Additionally, the paper introduces *shortcut-splitting quantization* to handle the bimodal activation distributions caused by the concatenation of deep and shallow feature channels in shortcuts, quantizing them separately before concatenation for improved accuracy with minimal extra resources.</td>
<td><a href="https://arxiv.org/pdf/2302.04304" target="_blank">Link</a></td>
</tr>
<tr>
<td>Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection</td>
<td>Alireza Ganjdanesh et al</td>
<td>2024</td>
<td>diffusion, sampling</td>
<td>Arxiv</td>
<td>This paper reduces the cost of sampling via pruning a pretrained diffusion model into a mixture of experts (MoE) for their respective time intervals, via a routing agent that predicts the architecture needed to generate the experts.</td>
<td><a href="https://arxiv.org/pdf/2409.15557" target="_blank">Link</a></td>
</tr>
<tr>
<td>A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training</td>
<td>Kai Wang et al</td>
<td>2024</td>
<td>diffusion, sampling</td>
<td>Arxiv</td>
<td>This paper introduces SpeeD, a novel approach for accelerating the training of diffusion models without compromising performance. The authors analyze the diffusion process and identify three distinct areas: acceleration, deceleration, and convergence, each with different characteristics and importance for model training. Based on these insights, SpeeD implements two key components: asymmetric sampling, which reduces the sampling of less informative time steps in the convergence area, and change-aware weighting, which gives more importance to the rapidly changing areas between acceleration and deceleration. The authors' key insight is that not all time steps in the diffusion process are equally valuable for training, with the convergence area providing limited benefits despite occupying a large proportion of time steps, while the rapidly changing area between acceleration and deceleration is crucial but often undersampled. To address this, SpeeD introduces an asymmetric sampling strategy using a two-step probability function: $P(t) = \begin{cases} \frac{k}{T + \tau(k-1)}, & 0 < t \leq \tau \ \frac{1}{T + \tau(k-1)}, & \tau < t \leq T \end{cases}$, where τ is a carefully selected threshold marking the beginning of the convergence area, k is a suppression intensity factor, T is the total number of time steps, and t is the current time step. This function increases sampling probability before τ and suppresses it after. Additionally, SpeeD employs a change-aware weighting scheme based on the gradient of the process increment's variance, assigning higher weights to time steps with faster changes. By combining these strategies, SpeeD aims to focus computational resources on the most informative parts of the diffusion process, potentially leading to significant speedups in training time without sacrificing model quality.</td>
<td><a href="https://arxiv.org/pdf/2405.17403" target="_blank">Link</a></td>
</tr>
<tr>
<td>HyperGAN: A Generative Model for Diverse, Performant Neural Networks</td>
<td>Neale Ratzlaff et al</td>
<td>2019</td>
<td>gan, ensemble</td>
<td>ICML</td>
<td>This paper introduces HyperGAN, a novel generative model designed to learn a distribution of neural network parameters, addressing the issue of overconfidence in standard neural networks when faced with out-of-distribution data. Unlike traditional approaches, HyperGAN doesn't require restrictive prior assumptions and can rapidly generate large, diverse ensembles of neural networks. The model employs a unique "mixer" component that projects prior samples into a correlated latent space, from which layer-specific generators create weights for a deep neural network. Experimental results show that HyperGAN can achieve competitive performance on datasets like MNIST and CIFAR-10 while providing improved uncertainty estimates for out-of-distribution and adversarial data compared to standard ensembles. NOTE: There has actually been a diffusion variant of this idea: https://arxiv.org/pdf/2402.13144</td>
<td><a href="https://arxiv.org/pdf/2405.17403" target="_blank">Link</a></td>
</tr>
<tr>
<td>Diffusion Models Already Have a Semantic Latent Space</td>
<td>Mingi Kwon et al</td>
<td>2023</td>
<td>diffusion, latent space</td>
<td>ICLR</td>
<td>This paper introduces Asymmetric Reverse Process (Asyrp), a method that discovers a semantic latent space (h-space) in pretrained diffusion models, enabling controlled image manipulation with desirable properties such as homogeneity, linearity, and consistency across timesteps, while also proposing a principled design for versatile editing and quality enhancement in the generative process. The authors propose Asymmetric Reverse Process (Asyrp). It modifies only the P_{t} term while preserving the D_{t} term in the reverse process. This makes sense because it a) breaks the destructive interference seen in previous methods, b) allows for controlled modification of the generation process towards target attributes, and c) maintains the overall structure and quality of the diffusion process.</td>
<td><a href="https://arxiv.org/pdf/2210.10960" target="_blank">Link</a></td>
</tr>
<tr>
<td>One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale</td>
<td>Fan Bao et al</td>
<td>2023</td>
<td>diffusion, multi-model</td>
<td>ICML</td>
<td>The authors present a method of sampling from joint and conditional distributions using a small modification on diffusion models. UniDiffuser’s proposed method involves handling multiple modalities (such as images and text) within a single diffusion model. Here is in general what they do: 1. Perturb data in all modalities: For a given data point (x0,y0), where x0 is an image and y0 is text, UniDiffuser adds noise to both simultaneously. The noisy versions are represented as xt_{x} and yt_{y}, where t_{x} and t_{y} are the respective timesteps. 2. Use of individual timesteps for different modalities: Instead of using a single timestep t for both modalities, UniDiffuser uses separate timesteps t_{x} and t_{y}. This allows for more flexibility in handling the different characteristics of each modality. 3. Predicting noise for all modalities simultaneously: UniDiffuser uses a joint noise prediction network \epsilon_{\theta}(xt_{x},yt_{y},t_{x},t_{y}) that takes in the noisy versions of both modalities and their respective timesteps. The network then outputs predicted noise for both modalities in one forward pass.</td>
<td><a href="https://arxiv.org/pdf/2303.06555" target="_blank">Link</a></td>
</tr>
<tr>
<td>Diffusion Models as a Representation Learner</td>
<td>Xingyi Yang et al</td>
<td>2023</td>
<td>diffusion, representation learner</td>
<td>ICCV</td>
<td>This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant!</td>
<td><a href="https://openaccess.thecvf.com/content/ICCV2023/papers/Gao_Masked_Diffusion_Transformer_is_a_Strong_Image_Synthesizer_ICCV_2023_paper.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Masked Diffusion Transformer is a Strong Image Synthesizer</td>
<td>Shanghua Gao et al</td>
<td>2023</td>
<td>diffusion, masking, transformer</td>
<td>ICCV</td>
<td>This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant!</td>
<td><a href="https://openaccess.thecvf.com/content/ICCV2023/papers/Gao_Masked_Diffusion_Transformer_is_a_Strong_Image_Synthesizer_ICCV_2023_paper.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>Generative Modeling by Estimating Gradients of the Data Distribution</td>
<td>Yang Song et al</td>
<td>2019</td>
<td>diffusion, score matching</td>
<td>NeurIPS</td>
<td>This paper introduces Noise Conditional Score Networks (NCSNs), a novel approach to generative modeling that learns to estimate the score function of a data distribution at multiple noise levels. NCSNs are trained using score matching, avoiding the need to compute normalizing constants, and generate samples using annealed Langevin dynamics. The method addresses challenges in modeling complex, high-dimensional data distributions, particularly for data lying on or near low-dimensional manifolds.</td>
<td><a href="https://arxiv.org/pdf/1907.05600" target="_blank">Link</a></td>
</tr>
<tr>
<td>LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compression Diffusion Models</td>
<td>Dingkun Zhang et al</td>
<td>2024</td>
<td>diffusion, pruning</td>
<td>Arxiv</td>
<td>This paper proposes layer pruning and normalized distillation for pruning diffusion models. They use a surrogate function and show that their surrogate implies a property called "additivity", where the output distortion caused by many perturbations approximately equals the sum of the output distortion caused by each single perturbation. They then show that their computation can be formed as a 0-1 Knapsack problem. They then analyze what is the important objective for retraining, and see that there is an imbalance in previous feature distillation approaches employed in the retraining phase. They note that the L2-Norms of feature maps at the end of different stages and the values of different feature loss terms vary significantly, for instance, the highest loss term is ~10k times greater than the lowest one throughout the distillation process, and produces about 1k times larger gradients. This dilutes the gradients of the numerically insignificant feature loss terms. So, they opt to normalize the feature loss.</td>
<td><a href="https://arxiv.org/pdf/2404.11098" target="_blank">Link</a></td>
</tr>
<tr>
<td>Classifier-Free Diffusion Guidance</td>
<td>Jonathan Ho et al</td>
<td>2022</td>
<td>diffusion, guidance</td>
<td>NeurIPS</td>
<td>This paper introduces classifier-free guidance, a novel technique for improving sample quality in conditional diffusion models without using a separate classifier. Unlike traditional classifier guidance, which relies on gradients from an additional classifier model, classifier-free guidance achieves similar results by combining score estimates from jointly trained conditional and unconditional diffusion models. The method involves training a single neural network that can produce both conditional and unconditional score estimates, and then using a weighted combination of these estimates during the sampling process. This approach simplifies the training pipeline, avoids potential issues associated with training classifiers on noisy data, and eliminates the need for adversarial attacks on classifiers during sampling. The authors demonstrate that classifier-free guidance can achieve a similar trade-off between Fréchet Inception Distance (FID) and Inception Score (IS) as classifier guidance, effectively boosting sample quality while reducing diversity. The key difference is that classifier-free guidance operates purely within the generative model framework, without relying on external classifier gradients. This method provides an intuitive explanation for how guidance works: it increases conditional likelihood while decreasing unconditional likelihood, pushing generated samples towards more characteristic features of the desired condition.</td>
<td><a href="https://arxiv.org/pdf/2207.12598" target="_blank">Link</a></td>
</tr>
<tr>
<td>LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights</td>
<td>Thibault Castells et al</td>
<td>2024</td>
<td>pruning, diffusion, ldm</td>
<td>CVPR</td>
<td>This paper presents LD-Pruner. The main interesting part is how the frame the pruning problem. Basically, they define an "operator" (any fundamental building block of a net, like convolutional layers, activation functions, transformer blocks), and try to either 1) remove it or 2) replace it with a less demanding operation. As they operate on the latent space, this work can be applied to any generation that uses diffusion (task agnostic). It is interesting to note their limitations: the approach does not extend to pruning the decoder, and their approach does not consider dependencies between operators (which is a big deal I think). Finally, their score function seems a bit arbitrary (maybe this could be learned?).</td>
<td><a href="https://openaccess.thecvf.com/content/CVPR2024W/EDGE/papers/Castells_LD-Pruner_Efficient_Pruning_of_Latent_Diffusion_Models_using_Task-Agnostic_Insights_CVPRW_2024_paper.pdf" target="_blank">Link</a></td>
</tr>
<tr>
<td>RoFormer: Enhanced Transformer with Rotary Position Embedding</td>
<td>Jianlin Su et al</td>
<td>2021</td>
<td>attention, positional embedding</td>
<td>Arxiv</td>
<td>This paper introduces Rotary Position Embedding (RoPE), a method for integrating positional information into transformer models by using a rotation matrix to encode absolute positions and incorporating relative position dependencies.</td>
<td><a href="https://arxiv.org/pdf/2104.09864" target="_blank">Link</a></td>
</tr>
<tr>
<td>GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models</td>
<td>Alex Nichol et al</td>
<td>2022</td>
<td>text-conditioned diffusion, inpainting</td>
<td>Arxiv</td>
<td>This paper explores text-conditional image synthesis using diffusion models, comparing CLIP guidance and classifier-free guidance, and finds that classifier-free guidance produces more photorealistic and caption-aligned images.</td>
<td><a href="https://arxiv.org/pdf/2112.10741" target="_blank">Link</a></td>
</tr>
<tr>
<td>LLM Inference Unveiled: Survey and Roofline Model Insights</td>
<td>Roger Waleffe et al</td>
<td>2024</td>
<td>llms, survey</td>
<td>Arxiv</td>
<td>This paper surveys some recent advancements in LLC inference, like speculative decoding or operator fusion. They also analyze the findings using the Roofline model, which is likely the first paper to do such a thing for LLM inference. Good for checking out other papers that have recently been published.</td>
<td><a href="https://arxiv.org/pdf/2402.16363" target="_blank">Link</a></td>
</tr>
<tr>
<td>An Empirical Study of Mamba-based Language Models</td>
<td>Roger Waleffe et al</td>
<td>2024</td>
<td>mamba, llms, transformer</td>
<td>Arxiv</td>
<td>This paper compares Mamba-based, Transformer-based, and hybrid-based language models in a controlled setting where sizes and datasets are larger than the past (8B-params / 3.5T tokens). They find that Mamba and Mamba-2 lag behind Transformer models on copying and in-context learning tasks. They then see that a hybrid architecture of 43% Mamba, 7% self attention, and 50% MLP layers performs better than all others.</td>
<td><a href="https://arxiv.org/pdf/2406.07887" target="_blank">Link</a></td>
</tr>
<tr>
<td>Diffusion Models Beat GANs on Image Synthesis</td>
<td>Prafulla Dhariwal et al</td>
<td>2021</td>
<td>diffusion, gan</td>
<td>Arxiv</td>
<td>This work demonstrates that diffusion models surpass the current state-of-the-art generative models in image quality, achieved through architecture improvements and classifier guidance, which balances diversity and fidelity. The model attains FID scores of 2.97 on ImageNet 128×128 and 4.59 on ImageNet 256×256, matching BigGAN-deep with as few as 25 forward passes while maintaining better distribution coverage. Additionally, combining classifier guidance with upsampling diffusion models further enhances FID scores to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512.</td>
<td><a href="https://arxiv.org/pdf/2105.05233" target="_blank">Link</a></td>
</tr>
<tr>
<td>Progressive Distillation for Fast Sampling of Diffusion Models</td>
<td>Tim Salimans et al</td>
<td>2022</td>
<td>diffusion, distillation, sampling</td>
<td>ICLR</td>
<td>Diffusion models excel in generative modeling, surpassing GANs in perceptual quality and autoregressive models in density estimation, but they suffer from slow sampling times. This paper introduces two key contributions: new parameterizations that improve stability with fewer sampling steps and a distillation method that progressively reduces the number of required steps by half each time. Applied to benchmarks like CIFAR-10 and ImageNet, the approach distills models from 8192 steps down to as few as 4 steps, maintaining high image quality while offering a more efficient solution for both training and inference.</td>
<td><a href="https://arxiv.org/pdf/2202.00512" target="_blank">Link</a></td>
</tr>
<tr>
<td>On Distillation of Guided Diffusion Models</td>