-
Notifications
You must be signed in to change notification settings - Fork 0
/
slides.html
2682 lines (2604 loc) · 277 KB
/
slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en"><head>
<link href="../../assets/favicon.svg" rel="icon" type="image/svg+xml">
<script src="../../site_libs/clipboard/clipboard.min.js"></script>
<script src="../../site_libs/quarto-html/tabby.min.js"></script>
<script src="../../site_libs/quarto-html/popper.min.js"></script>
<script src="../../site_libs/quarto-html/tippy.umd.min.js"></script>
<link href="../../site_libs/quarto-html/tippy.css" rel="stylesheet">
<link href="../../site_libs/quarto-html/light-border.css" rel="stylesheet">
<link href="../../site_libs/quarto-html/quarto-syntax-highlighting-3a1b321c56de4570634214b58c69b8f7.css" rel="stylesheet" id="quarto-text-highlighting-styles">
<script src="../../site_libs/quarto-contrib/iconify-2.1.0/iconify-icon.min.js"></script>
<link href="../../site_libs/quarto-contrib/fontawesome6-0.1.0/all.css" rel="stylesheet">
<link href="../../site_libs/quarto-contrib/fontawesome6-0.1.0/latex-fontsize.css" rel="stylesheet">
<script src="../../site_libs/quarto-contrib/glightbox/glightbox.min.js"></script>
<link href="../../site_libs/quarto-contrib/glightbox/glightbox.min.css" rel="stylesheet">
<link href="../../site_libs/quarto-contrib/glightbox/lightbox.css" rel="stylesheet"><meta charset="utf-8">
<meta name="generator" content="quarto-1.7.5">
<meta name="author" content="Sam Foreman">
<meta name="dcterms.date" content="2024-08-09">
<title>Sam Foreman – Training LLMs at Scale</title>
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../../site_libs/revealjs/dist/reset.css">
<link rel="stylesheet" href="../../site_libs/revealjs/dist/reveal.css">
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em);}
div.column{flex: auto; overflow-x: auto;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
ul.task-list li input[type="checkbox"] {
width: 0.8em;
margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */
vertical-align: middle;
}
/* CSS for syntax highlighting */
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
}
pre.numberSource { margin-left: 3em; padding-left: 4px; }
div.sourceCode
{ color: #383a42; }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span { color: #383a42; } /* Normal */
code span.al { color: #95da4c; background-color: #4d1f24; font-weight: bold; } /* Alert */
code span.an { color: #50a14f; } /* Annotation */
code span.at { color: #a626a4; } /* Attribute */
code span.bn { color: #986801; } /* BaseN */
code span.bu { color: #a626a4; } /* BuiltIn */
code span.cf { color: #a626a4; } /* ControlFlow */
code span.ch { color: #50a14f; } /* Char */
code span.cn { color: #986801; } /* Constant */
code span.co { color: #a0a1a7; font-style: italic; } /* Comment */
code span.cv { color: #e45649; font-style: italic; } /* CommentVar */
code span.do { color: #e45649; } /* Documentation */
code span.dt { color: #a626a4; } /* DataType */
code span.dv { color: #986801; } /* DecVal */
code span.er { color: #f44747; text-decoration: underline; } /* Error */
code span.ex { color: #4078f2; font-weight: bold; } /* Extension */
code span.fl { color: #986801; } /* Float */
code span.fu { color: #4078f2; } /* Function */
code span.im { color: #50a14f; } /* Import */
code span.in { color: #c45b00; } /* Information */
code span.kw { color: #a626a4; } /* Keyword */
code span.op { color: #a626a4; } /* Operator */
code span.ot { color: #27ae60; } /* Other */
code span.pp { color: #a626a4; } /* Preprocessor */
code span.re { color: #2980b9; background-color: #153042; } /* RegionMarker */
code span.sc { color: #0184bc; } /* SpecialChar */
code span.ss { color: #da4453; } /* SpecialString */
code span.st { color: #50a14f; } /* String */
code span.va { color: #e45649; } /* Variable */
code span.vs { color: #da4453; } /* VerbatimString */
code span.wa { color: #da4453; } /* Warning */
/* CSS for citations */
div.csl-bib-body { }
div.csl-entry {
clear: both;
margin-bottom: 0em;
}
.hanging-indent div.csl-entry {
margin-left:2em;
text-indent:-2em;
}
div.csl-left-margin {
min-width:2em;
float:left;
}
div.csl-right-inline {
margin-left:2em;
padding-left:1em;
}
div.csl-indent {
margin-left: 2em;
} </style>
<link rel="stylesheet" href="../../site_libs/revealjs/dist/theme/quarto-900700500f52478e259b5d0dc23713d5.css">
<link rel="stylesheet" href="../../css/custom.css">
<link rel="stylesheet" href="../../css/svgbob.css">
<link rel="stylesheet" href="../../css/ibm-plex.css">
<link rel="stylesheet" href="../../static/fonts/IosevkaAileSansQPss15/IosevkaAileSansQPss15.css">
<link rel="stylesheet" href="../../static/fonts/IosevkaSansTerminalss15Custom/IosevkaSansTerminalss15Custom.css">
<link rel="stylesheet" href="../../static/fonts/iosevka-custom/iosevka-custom.css">
<script>window.backupDefine = window.define; window.define = undefined;</script><script src="https://cdn.jsdelivr.net/npm/katex@latest/dist/katex.min.js"></script>
<script>document.addEventListener("DOMContentLoaded", function () {
var mathElements = document.getElementsByClassName("math");
var macros = [];
for (var i = 0; i < mathElements.length; i++) {
var texText = mathElements[i].firstChild;
if (mathElements[i].tagName == "SPAN") {
katex.render(texText.data, mathElements[i], {
displayMode: mathElements[i].classList.contains('display'),
throwOnError: false,
macros: macros,
fleqn: false
});
}}});
</script>
<script>window.define = window.backupDefine; window.backupDefine = undefined;</script><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@latest/dist/katex.min.css">
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-XVM2Y822Y1"></script>
<script type="text/javascript">
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-XVM2Y822Y1', { 'anonymize_ip': true});
</script>
<link href="../../site_libs/revealjs/plugin/quarto-line-highlight/line-highlight.css" rel="stylesheet">
<link href="../../site_libs/revealjs/plugin/reveal-menu/menu.css" rel="stylesheet">
<link href="../../site_libs/revealjs/plugin/reveal-menu/quarto-menu.css" rel="stylesheet">
<link href="../../site_libs/revealjs/plugin/reveal-chalkboard/font-awesome/css/all.css" rel="stylesheet">
<link href="../../site_libs/revealjs/plugin/reveal-chalkboard/style.css" rel="stylesheet">
<link href="../../site_libs/revealjs/plugin/quarto-support/footer.css" rel="stylesheet">
<style type="text/css">
.reveal div.sourceCode {
margin: 0;
overflow: auto;
}
.reveal div.hanging-indent {
margin-left: 1em;
text-indent: -1em;
}
.reveal .slide:not(.center) {
height: 100%;
overflow-y: auto;
}
.reveal .slide.scrollable {
overflow-y: auto;
}
.reveal .footnotes {
height: 100%;
overflow-y: auto;
}
.reveal .slide .absolute {
position: absolute;
display: block;
}
.reveal .footnotes ol {
counter-reset: ol;
list-style-type: none;
margin-left: 0;
}
.reveal .footnotes ol li:before {
counter-increment: ol;
content: counter(ol) ". ";
}
.reveal .footnotes ol li > p:first-child {
display: inline-block;
}
.reveal .slide ul,
.reveal .slide ol {
margin-bottom: 0.5em;
}
.reveal .slide ul li,
.reveal .slide ol li {
margin-top: 0.4em;
margin-bottom: 0.2em;
}
.reveal .slide ul[role="tablist"] li {
margin-bottom: 0;
}
.reveal .slide ul li > *:first-child,
.reveal .slide ol li > *:first-child {
margin-block-start: 0;
}
.reveal .slide ul li > *:last-child,
.reveal .slide ol li > *:last-child {
margin-block-end: 0;
}
.reveal .slide .columns:nth-child(3) {
margin-block-start: 0.8em;
}
.reveal blockquote {
box-shadow: none;
}
.reveal .tippy-content>* {
margin-top: 0.2em;
margin-bottom: 0.7em;
}
.reveal .tippy-content>*:last-child {
margin-bottom: 0.2em;
}
.reveal .slide > img.stretch.quarto-figure-center,
.reveal .slide > img.r-stretch.quarto-figure-center {
display: block;
margin-left: auto;
margin-right: auto;
}
.reveal .slide > img.stretch.quarto-figure-left,
.reveal .slide > img.r-stretch.quarto-figure-left {
display: block;
margin-left: 0;
margin-right: auto;
}
.reveal .slide > img.stretch.quarto-figure-right,
.reveal .slide > img.r-stretch.quarto-figure-right {
display: block;
margin-left: auto;
margin-right: 0;
}
</style>
<meta name="mermaid-theme" content="neutral">
<script src="../../site_libs/quarto-diagram/mermaid.min.js"></script>
<script src="../../site_libs/quarto-diagram/mermaid-init.js"></script>
<link href="../../site_libs/quarto-diagram/mermaid.css" rel="stylesheet">
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-TC329HJ');</script>
<!-- End Google Tag Manager -->
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin="">
<link href="https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;1,100;1,200;1,300;1,400;1,500;1,600;1,700&family=IBM+Plex+Sans+Condensed:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;1,100;1,200;1,300;1,400;1,500;1,600;1,700&family=IBM+Plex+Sans:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;1,100;1,200;1,300;1,400;1,500;1,600;1,700&family=IBM+Plex+Serif:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;1,100;1,200;1,300;1,400;1,500;1,600;1,700&display=swap" rel="stylesheet">
<link href="https://iosevka-webfonts.github.io/iosevka/Iosevka.css" rel="stylesheet">
<meta property="og:title" content="Training LLMs at Scale">
<meta property="og:description" content="Training LLMs at Scale">
<meta property="og:image" content="https://samforeman.me/talks/llms-at-scale/assets/thumbnail.png">
<meta property="og:site_name" content="Sam Foreman">
<meta property="og:image:height" content="1600">
<meta property="og:image:width" content="3840">
<meta name="twitter:title" content="Training LLMs at Scale">
<meta name="twitter:description" content="Training LLMs at Scale">
<meta name="twitter:image" content="https://samforeman.me/talks/llms-at-scale/assets/thumbnail.png">
<meta name="twitter:creator" content="saforem2">
<meta name="twitter:site" content="saforem2">
<meta name="twitter:card" content="summary">
<meta name="twitter:image-height" content="1600">
<meta name="twitter:image-width" content="3840">
<meta name="citation_title" content="Training LLMs at Scale">
<meta name="citation_author" content="Sam Foreman">
<meta name="citation_publication_date" content="2024-08-09">
<meta name="citation_cover_date" content="2024-08-09">
<meta name="citation_year" content="2024">
<meta name="citation_online_date" content="2024-08-09">
<meta name="citation_fulltext_html_url" content="https://samforeman.me/talks/llms-at-scale">
<meta name="citation_language" content="en">
<meta name="citation_reference" content="citation_title=Superconductivity of in and sn samples;,citation_author=George Deamont;,citation_author=Sam Foreman;,citation_publication_date=2014;,citation_cover_date=2014;,citation_year=2014;">
<meta name="citation_reference" content="citation_title=RG-inspired machine learning for lattice field theory;,citation_author=Sam Foreman;,citation_author=Joel Giedt;,citation_author=Yannick Meurice;,citation_author=Judah Unmuth-Yockey;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_volume=175;,citation_conference_title=EPJ web of conferences;,citation_conference=EDP Sciences;">
<meta name="citation_reference" content="citation_title=Large energy density in three-plate nanocapacitors due to coulomb blockade;,citation_author=A Hubler;,citation_author=S Foreman;,citation_author=J Liu;,citation_author=L Wortsmann;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_issue=10;,citation_volume=123;,citation_journal_title=Journal of Applied Physics;,citation_publisher=AIP Publishing;">
<meta name="citation_reference" content="citation_title=Examples of renormalization group transformations for image sets;,citation_author=Samuel Foreman;,citation_author=Joel Giedt;,citation_author=Yannick Meurice;,citation_author=Judah Unmuth-Yockey;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_issue=5;,citation_volume=98;,citation_journal_title=Physical Review E;,citation_publisher=American Physical Society;">
<meta name="citation_reference" content="citation_title=Machine learning inspired analysis of the ising model transition;,citation_author=Samuel Foreman;,citation_author=Joel Giedt;,citation_author=Yannick Meurice;,citation_author=Judah Unmuth-Yockey;,citation_publication_date=2018;,citation_cover_date=2018;,citation_year=2018;,citation_conference_title=Lattice 2018;">
<meta name="citation_reference" content="citation_title=Learning better physics: A machine learning approach to lattice gauge theory;,citation_author=Samuel Alfred Foreman;,citation_publication_date=2019;,citation_cover_date=2019;,citation_year=2019;,citation_dissertation_institution=University of Iowa;">
<meta name="citation_reference" content="citation_title=Machine learning and neural networks for field theory;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2020;,citation_cover_date=2020;,citation_year=2020;">
<meta name="citation_reference" content="citation_title=Deep learning hamiltonian monte carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_journal_title=arXiv preprint arXiv:2105.03418;">
<meta name="citation_reference" content="citation_title=HMC with normalizing flows;,citation_author=Sam Foreman;,citation_author=Taku Izubuchi;,citation_author=Luchang Jin;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_author=Akio Tomiya;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_journal_title=arXiv preprint arXiv:2112.01586;">
<meta name="citation_reference" content="citation_title=LeapfrogLayers: A trainable framework for effective topological sampling;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_journal_title=arXiv preprint arXiv:2112.01582;">
<meta name="citation_reference" content="citation_title=Energy storage in quantum resonators;,citation_author=Jiaqi Liu;,citation_author=Alfred W Hubler;,citation_author=Samuel Alfred Foreman;,citation_author=Katharina Ott;,citation_publication_date=2017;,citation_cover_date=2017;,citation_year=2017;">
<meta name="citation_reference" content="citation_title=Applications of machine learning to lattice quantum field theory;,citation_author=Denis Boyda;,citation_author=Salvatore Calı̀;,citation_author=Sam Foreman;,citation_author=Lena Funcke;,citation_author=Daniel C Hackett;,citation_author=Yin Lin;,citation_author=Gert Aarts;,citation_author=Andrei Alexandru;,citation_author=Xiao-Yong Jin;,citation_author=Biagio Lucini;,citation_author=others;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_journal_title=arXiv preprint arXiv:2202.05838;">
<meta name="citation_reference" content="citation_title=Lattice QCD and particle physics;,citation_author=Andreas S Kronfeld;,citation_author=Tanmoy Bhattacharya;,citation_author=Thomas Blum;,citation_author=Norman H Christ;,citation_author=Carleton DeTar;,citation_author=William Detmold;,citation_author=Robert Edwards;,citation_author=Anna Hasenfratz;,citation_author=Huey-Wen Lin;,citation_author=Swagato Mukherjee;,citation_author=others;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_journal_title=arXiv preprint arXiv:2207.07641;">
<meta name="citation_reference" content="citation_title=GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics;,citation_author=Maxim Zvyagin;,citation_author=Alexander Brace;,citation_author=Kyle Hippe;,citation_author=Yuntian Deng;,citation_author=Bin Zhang;,citation_author=Cindy Orozco Bohorquez;,citation_author=Austin Clyde;,citation_author=Bharat Kale;,citation_author=Danilo Perez-Rivera;,citation_author=Heng Ma;,citation_author=others;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_issue=6;,citation_volume=37;,citation_journal_title=The International Journal of High Performance Computing Applications;,citation_publisher=SAGE Publications Sage UK: London, England;">
<meta name="citation_reference" content="citation_title=MLMC: Machine learning monte carlo;,citation_author=Sam Foreman;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_conference_title=The international symposium on lattice field theory;">
<meta name="citation_reference" content="citation_title=Superconductivity of in and sn samples;,citation_author=George Deamont;,citation_author=Sam Foreman;,citation_publication_date=2014;,citation_cover_date=2014;,citation_year=2014;">
<meta name="citation_reference" content="citation_title=A comprehensive performance study of large language models on novel AI accelerators;,citation_author=Murali Emani;,citation_author=Sam Foreman;,citation_author=Varuni Sastry;,citation_author=Zhen Xie;,citation_author=Siddhisanket Raskar;,citation_author=William Arnold;,citation_author=Rajeev Thakur;,citation_author=Venkatram Vishwanath;,citation_author=Michael E Papka;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_journal_title=arXiv preprint arXiv:2310.04607;">
<meta name="citation_reference" content="citation_title=DeepSpeed4Science initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies;,citation_author=Shuaiwen Leon Song;,citation_author=Bonnie Kruft;,citation_author=Minjia Zhang;,citation_author=Conglong Li;,citation_author=Shiyang Chen;,citation_author=Chengming Zhang;,citation_author=Masahiro Tanaka;,citation_author=Xiaoxia Wu;,citation_author=Jeff Rasley;,citation_author=Ammar Ahmad Awan;,citation_author=others;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_journal_title=arXiv preprint arXiv:2310.04610;">
<meta name="citation_reference" content="citation_title=Protein generation via genome-scale language models with bio-physical scoring;,citation_author=Gautham Dharuman;,citation_author=Logan Ward;,citation_author=Heng Ma;,citation_author=Priyanka V Setty;,citation_author=Ozan Gokdemir;,citation_author=Sam Foreman;,citation_author=Murali Emani;,citation_author=Kyle Hippe;,citation_author=Alexander Brace;,citation_author=Kristopher Keipert;,citation_author=others;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_conference_title=Proceedings of the SC’23 workshops of the international conference on high performance computing, network, storage, and analysis;">
<meta name="citation_reference" content="citation_title=MLMC: Machine learning monte carlo for lattice gauge theory;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C Osborn;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_journal_title=arXiv preprint arXiv:2312.08936;">
<meta name="citation_reference" content="citation_title=Snowmass 2021 computational frontier CompF03 topical group report: Machine learning;,citation_author=Phiala Shanahan;,citation_author=Kazuhiro Terao;,citation_author=Daniel Whiteson;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_journal_title=arXiv preprint arXiv:2209.07559;">
<meta name="citation_reference" content="citation_title=Thorough characterization and analysis of large transformer model training at-scale;,citation_author=Scott Cheng;,citation_author=Jun-Liang Lin;,citation_author=Murali Emani;,citation_author=Siddhisanket Raskar;,citation_author=Sam Foreman;,citation_author=Zhen Xie;,citation_author=Venkatram Vishwanath;,citation_author=Mahmut Taylan Kandemir;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_issue=1;,citation_volume=8;,citation_journal_title=Proceedings of the ACM on Measurement and Analysis of Computing Systems;,citation_publisher=ACM New York, NY, USA;">
<meta name="citation_reference" content="citation_title=Communities through energy justice projects;,citation_author=Mary Ann Leung;,citation_author=Katharine Cahill;,citation_author=Rebecca Hartman-Baker;,citation_author=Paige Kinsley;,citation_author=Lois Curfman McInnes;,citation_author=Suzanne Parete-Koon;,citation_author=Subil Abraham;,citation_author=Lacy Beach Barrier;,citation_author=Gladys Chen;,citation_author=Lizanne DeStefano;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_issue=1;,citation_volume=15;,citation_journal_title=Journal of Computational Science;">
<meta name="citation_reference" content="citation_title=Applications of a foundation model approach for weather and climate;,citation_author=Troy Arcomano;,citation_author=Alexander Wikner;,citation_author=Romit Maulik;,citation_author=Veerabhadra Rao Kotamarthi;,citation_author=Sam Foreman;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_volume=2023;,citation_conference_title=AGU fall meeting abstracts;">
<meta name="citation_reference" content="citation_title=Toward a holistic performance evaluation of large language models across diverse ai accelerators;,citation_author=Murali Emani;,citation_author=Sam Foreman;,citation_author=Varuni Sastry;,citation_author=Zhen Xie;,citation_author=Siddhisanket Raskar;,citation_author=William Arnold;,citation_author=Rajeev Thakur;,citation_author=Venkatram Vishwanath;,citation_author=Michael E Papka;,citation_author=Sanjif Shanmugavelu;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_conference_title=2024 IEEE international parallel and distributed processing symposium workshops (IPDPSW);,citation_conference=IEEE;">
<meta name="citation_reference" content="citation_title=Intro to HPC bootcamp: Engaging new communities through energy justice projects;,citation_author=Suzanne Parete-Koon;,citation_author=Michael Sandoval;,citation_author=Kellen Leland;,citation_author=Subil Abraham;,citation_author=Mary Ann Leung;,citation_author=Rebecca Hartman-Baker;,citation_author=Paige Kinsley;,citation_author=Lois McInnes;,citation_author=Sreeranjani Ramprakash;,citation_author=Lacy Beach Barrier;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_issue=1;,citation_volume=15;,citation_journal_title=Journal of Computational Science Education;,citation_publisher=Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States);">
<meta name="citation_reference" content="citation_title=MProt-DPO: Breaking the ExaFLOPS barrier for multimodal protein design workflows with direct preference optimization;,citation_author=Gautham Dharuman;,citation_author=Kyle Hippe;,citation_author=Alexander Brace;,citation_author=Sam Foreman;,citation_author=Väinä Hatanpää;,citation_author=Varuni K Sastry;,citation_author=Huihuo Zheng;,citation_author=Logan Ward;,citation_author=Servesh Muralidharan;,citation_author=Archit Vasan;,citation_author=others;,citation_publication_date=2024;,citation_cover_date=2024;,citation_year=2024;,citation_conference_title=2024 SC24: International conference for high performance computing, networking, storage and analysis SC;,citation_conference=IEEE Computer Society;">
<meta name="citation_reference" content="citation_title=Emergent abilities of large language models;,citation_author=Jason Wei;,citation_author=Yi Tay;,citation_author=Rishi Bommasani;,citation_author=Colin Raffel;,citation_author=Barret Zoph;,citation_author=Sebastian Borgeaud;,citation_author=Dani Yogatama;,citation_author=Maarten Bosma;,citation_author=Denny Zhou;,citation_author=Donald Metzler;,citation_author=Ed H. Chi;,citation_author=Tatsunori Hashimoto;,citation_author=Oriol Vinyals;,citation_author=Percy Liang;,citation_author=Jeff Dean;,citation_author=William Fedus;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2206.07682;">
<meta name="citation_reference" content="citation_title=DeepSpeed4Science initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies;,citation_author=Shuaiwen Leon Song;,citation_author=Bonnie Kruft;,citation_author=Minjia Zhang;,citation_author=Conglong Li;,citation_author=Shiyang Chen;,citation_author=Chengming Zhang;,citation_author=Masahiro Tanaka;,citation_author=Xiaoxia Wu;,citation_author=Jeff Rasley;,citation_author=Ammar Ahmad Awan;,citation_author=Connor Holmes;,citation_author=Martin Cai;,citation_author=Adam Ghanem;,citation_author=Zhongzhu Zhou;,citation_author=Yuxiong He;,citation_author=Pete Luferenko;,citation_author=Divya Kumar;,citation_author=Jonathan Weyn;,citation_author=Ruixiong Zhang;,citation_author=Sylwester Klocek;,citation_author=Volodymyr Vragov;,citation_author=Mohammed AlQuraishi;,citation_author=Gustaf Ahdritz;,citation_author=Christina Floristean;,citation_author=Cristina Negri;,citation_author=Rao Kotamarthi;,citation_author=Venkatram Vishwanath;,citation_author=Arvind Ramanathan;,citation_author=Sam Foreman;,citation_author=Kyle Hippe;,citation_author=Troy Arcomano;,citation_author=Romit Maulik;,citation_author=Maxim Zvyagin;,citation_author=Alexander Brace;,citation_author=Bin Zhang;,citation_author=Cindy Orozco Bohorquez;,citation_author=Austin Clyde;,citation_author=Bharat Kale;,citation_author=Danilo Perez-Rivera;,citation_author=Heng Ma;,citation_author=Carla M. Mann;,citation_author=Michael Irvin;,citation_author=J. Gregory Pauloski;,citation_author=Logan Ward;,citation_author=Valerie Hayot;,citation_author=Murali Emani;,citation_author=Zhen Xie;,citation_author=Diangen Lin;,citation_author=Maulik Shukla;,citation_author=Ian Foster;,citation_author=James J. Davis;,citation_author=Michael E. Papka;,citation_author=Thomas Brettin;,citation_author=Prasanna Balaprakash;,citation_author=Gina Tourassi;,citation_author=John Gounley;,citation_author=Heidi Hanson;,citation_author=Thomas E Potok;,citation_author=Massimiliano Lupo Pasini;,citation_author=Kate Evans;,citation_author=Dan Lu;,citation_author=Dalton Lunga;,citation_author=Junqi Yin;,citation_author=Sajal Dash;,citation_author=Feiyi Wang;,citation_author=Mallikarjun Shankar;,citation_author=Isaac Lyngaas;,citation_author=Xiao Wang;,citation_author=Guojing Cong;,citation_author=Pei Zhang;,citation_author=Ming Fan;,citation_author=Siyan Liu;,citation_author=Adolfy Hoisie;,citation_author=Shinjae Yoo;,citation_author=Yihui Ren;,citation_author=William Tang;,citation_author=Kyle Felker;,citation_author=Alexey Svyatkovskiy;,citation_author=Hang Liu;,citation_author=Ashwin Aji;,citation_author=Angela Dalton;,citation_author=Michael Schulte;,citation_author=Karl Schulz;,citation_author=Yuntian Deng;,citation_author=Weili Nie;,citation_author=Josh Romero;,citation_author=Christian Dallago;,citation_author=Arash Vahdat;,citation_author=Chaowei Xiao;,citation_author=Thomas Gibbs;,citation_author=Anima Anandkumar;,citation_author=Rick Stevens;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2310.04610;">
<meta name="citation_reference" content="citation_title=Emergent abilities of large language models;,citation_author=Jason Wei;,citation_author=Yi Tay;,citation_author=Rishi Bommasani;,citation_author=Colin Raffel;,citation_author=Barret Zoph;,citation_author=Sebastian Borgeaud;,citation_author=Dani Yogatama;,citation_author=Maarten Bosma;,citation_author=Denny Zhou;,citation_author=Donald Metzler;,citation_author=Ed H. Chi;,citation_author=Tatsunori Hashimoto;,citation_author=Oriol Vinyals;,citation_author=Percy Liang;,citation_author=Jeff Dean;,citation_author=William Fedus;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2206.07682;">
<meta name="citation_reference" content="citation_title=The climate risk &amp;amp; resilience portal (ClimRR) metadata and data dictionary;,citation_author=C. Burdi;,citation_author=Wall. T Branham;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://dub.sh/ClimRR-Metadata;">
<meta name="citation_reference" content="citation_title=Progress on $(g-2)_\mu$ from lattice QCD;,citation_author=Hartmut Wittig;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2306.04165;">
<meta name="citation_reference" content="citation_title=Hybrid Monte Carlo;,citation_author=S. Duane;,citation_author=A. D. Kennedy;,citation_author=B. J. Pendleton;,citation_author=D. Roweth;,citation_publication_date=1987;,citation_cover_date=1987;,citation_year=1987;,citation_doi=10.1016/0370-2693(87)91197-X;,citation_volume=195;,citation_journal_title=Phys. Lett. B;">
<meta name="citation_reference" content="citation_title=Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning;,citation_author=Phiala Shanahan;,citation_author=others;,citation_publication_date=2022-09;,citation_cover_date=2022-09;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2209.07559;">
<meta name="citation_reference" content="citation_title=Applications of Machine Learning to Lattice Quantum Field Theory;,citation_author=Denis Boyda;,citation_author=others;,citation_publication_date=2022-02;,citation_cover_date=2022-02;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2202.05838;,citation_conference_title=Snowmass 2021;">
<meta name="citation_reference" content="citation_title=HMC with Normalizing Flows;,citation_author=Sam Foreman;,citation_author=Taku Izubuchi;,citation_author=Luchang Jin;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_author=Akio Tomiya;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01586;,citation_doi=10.22323/1.396.0073;,citation_volume=LATTICE2021;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=LeapfrogLayers: A Trainable Framework for Effective Topological Sampling;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01582;,citation_doi=10.22323/1.396.0508;,citation_volume=LATTICE2021;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=Deep Learning Hamiltonian Monte Carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2021-05;,citation_cover_date=2021-05;,citation_year=2021;,citation_fulltext_html_url=https://arxiv.org/abs/2105.03418;,citation_conference_title=9th International Conference on Learning Representations;">
<meta name="citation_reference" content="citation_title=Energy Justice Analysis of Climate Data with ClimRR;,citation_author=Sam Foreman;,citation_publication_date=2023-08-07;,citation_cover_date=2023-08-07;,citation_year=2023;,citation_fulltext_html_url=https://saforem2.github.io/climate-analysis;,citation_language=en;">
<meta name="citation_reference" content="citation_author=Sam Foreman;,citation_publication_date=2023-08-19;,citation_cover_date=2023-08-19;,citation_year=2023;,citation_fulltext_html_url=https://saforem2.github.io/l2hmc-qcd;,citation_language=en;">
<meta name="citation_reference" content="citation_title=Deep learning hamiltonian monte carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2021;,citation_cover_date=2021;,citation_year=2021;,citation_fulltext_html_url=https://arxiv.org/abs/2105.03418;">
<meta name="citation_reference" content="citation_title=MLMC: Machine learning monte carlo for lattice gauge theory;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James Osborn;,citation_publication_date=00;,citation_cover_date=00;,citation_year=0;,citation_conference_title=40th international symposium on lattice field theory (lattice 2023) (batavia, IL, united states, 07/31/2023 - 08/04/2023);">
<meta name="citation_reference" content="citation_title=Progress on $(g-2)_\mu$ from lattice QCD;,citation_author=Hartmut Wittig;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2306.04165;">
<meta name="citation_reference" content="citation_title=Hybrid Monte Carlo;,citation_author=S. Duane;,citation_author=A. D. Kennedy;,citation_author=B. J. Pendleton;,citation_author=D. Roweth;,citation_publication_date=1987;,citation_cover_date=1987;,citation_year=1987;,citation_doi=10.1016/0370-2693(87)91197-X;,citation_volume=195;,citation_journal_title=Phys. Lett. B;">
<meta name="citation_reference" content="citation_title=Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning;,citation_author=Phiala Shanahan;,citation_author=others;,citation_publication_date=2022-09;,citation_cover_date=2022-09;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2209.07559;">
<meta name="citation_reference" content="citation_title=Applications of Machine Learning to Lattice Quantum Field Theory;,citation_author=Denis Boyda;,citation_author=others;,citation_publication_date=2022-02;,citation_cover_date=2022-02;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2202.05838;,citation_conference_title=Snowmass 2021;">
<meta name="citation_reference" content="citation_title=LeapfrogLayers: A Trainable Framework for Effective Topological Sampling;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2022-05;,citation_cover_date=2022-05;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01582;,citation_doi=10.22323/1.396.0508;,citation_volume=LATTICE2021;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=HMC with Normalizing Flows;,citation_author=Sam Foreman;,citation_author=Taku Izubuchi;,citation_author=Luchang Jin;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_author=Akio Tomiya;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://arxiv.org/abs/2112.01586;,citation_doi=10.22323/1.396.0073;,citation_volume=LATTICE2021;,citation_journal_title=PoS;">
<meta name="citation_reference" content="citation_title=Deep Learning Hamiltonian Monte Carlo;,citation_author=Sam Foreman;,citation_author=Xiao-Yong Jin;,citation_author=James C. Osborn;,citation_publication_date=2021-05;,citation_cover_date=2021-05;,citation_year=2021;,citation_fulltext_html_url=https://arxiv.org/abs/2105.03418;,citation_conference_title=9th International Conference on Learning Representations;">
<meta name="citation_reference" content="citation_title=Mastering language models;,citation_author=Samuel Montgomery;,citation_publication_date=2023-10;,citation_cover_date=2023-10;,citation_year=2023;,citation_fulltext_html_url=https://towardsdatascience.com/mastering-language-models-32e1d891511a
;,citation_journal_title=Medium;,citation_publisher=Towards Data Science;">
<meta name="citation_reference" content="citation_title=Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond;,citation_author=Jingfeng Yang;,citation_author=Hongye Jin;,citation_author=Ruixiang Tang;,citation_author=Xiaotian Han;,citation_author=Qizhang Feng;,citation_author=Haoming Jiang;,citation_author=Bing Yin;,citation_author=Xia Hu;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2304.13712;">
<meta name="citation_reference" content="citation_title=Training tips for the transformer model;,citation_author=Martin Popel;,citation_author=Ondřej Bojar;,citation_publication_date=2018-04;,citation_cover_date=2018-04;,citation_year=2018;,citation_fulltext_html_url=https://doi.org/10.2478%2Fpralin-2018-0002;,citation_issue=1;,citation_doi=10.2478/pralin-2018-0002;,citation_volume=110;,citation_journal_title=The Prague Bulletin of Mathematical Linguistics;,citation_publisher=Charles University in Prague, Karolinum Press;">
<meta name="citation_reference" content="citation_title=Attention is all you need;,citation_author=Ashish Vaswani;,citation_author=Noam Shazeer;,citation_author=Niki Parmar;,citation_author=Jakob Uszkoreit;,citation_author=Llion Jones;,citation_author=Aidan N. Gomez;,citation_author=Lukasz Kaiser;,citation_author=Illia Polosukhin;,citation_publication_date=2017;,citation_cover_date=2017;,citation_year=2017;,citation_fulltext_html_url=https://arxiv.org/abs/1706.03762;">
<meta name="citation_reference" content="citation_title=Tree of thoughts: Deliberate problem solving with large language models;,citation_author=Shunyu Yao;,citation_author=Dian Yu;,citation_author=Jeffrey Zhao;,citation_author=Izhak Shafran;,citation_author=Thomas L. Griffiths;,citation_author=Yuan Cao;,citation_author=Karthik Narasimhan;,citation_publication_date=2023;,citation_cover_date=2023;,citation_year=2023;,citation_fulltext_html_url=https://arxiv.org/abs/2305.10601;">
<meta name="citation_reference" content="citation_title=GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics;,citation_abstract=We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.Competing Interest StatementThe authors have declared no competing interest.;,citation_author=Maxim Zvyagin;,citation_author=Alexander Brace;,citation_author=Kyle Hippe;,citation_author=Yuntian Deng;,citation_author=Bin Zhang;,citation_author=Cindy Orozco Bohorquez;,citation_author=Austin Clyde;,citation_author=Bharat Kale;,citation_author=Danilo Perez-Rivera;,citation_author=Heng Ma;,citation_author=Carla M. Mann;,citation_author=Michael Irvin;,citation_author=J. Gregory Pauloski;,citation_author=Logan Ward;,citation_author=Valerie Hayot-Sasson;,citation_author=Murali Emani;,citation_author=Sam Foreman;,citation_author=Zhen Xie;,citation_author=Diangen Lin;,citation_author=Maulik Shukla;,citation_author=Weili Nie;,citation_author=Josh Romero;,citation_author=Christian Dallago;,citation_author=Arash Vahdat;,citation_author=Chaowei Xiao;,citation_author=Thomas Gibbs;,citation_author=Ian Foster;,citation_author=James J. Davis;,citation_author=Michael E. Papka;,citation_author=Thomas Brettin;,citation_author=Rick Stevens;,citation_author=Anima Anandkumar;,citation_author=Venkatram Vishwanath;,citation_author=Arvind Ramanathan;,citation_publication_date=2022;,citation_cover_date=2022;,citation_year=2022;,citation_fulltext_html_url=https://www.biorxiv.org/content/early/2022/11/23/2022.10.10.511571;,citation_doi=10.1101/2022.10.10.511571;,citation_journal_title=bioRxiv;,citation_publisher=Cold Spring Harbor Laboratory;">
</head>
<body class="quarto-light">
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-TC329HJ" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<div class="reveal">
<div class="slides">
<section id="title-slide" background-color="white" data-background-color="white" data-background-iframe="https://emilhvitfeldt.github.io/quarto-iframe-examples/colored-particles/index.html" data-background-size="contain" class="quarto-title-block center">
<div class="quarto-title-container" style="background-color: rgba(245,245,245, 0.875); border-radius: 10px; text-align:center; padding: 0px; padding-left: 1.5em; padding-right: 1.5em; max-width: min-content; min-width: max-content; margin-left: auto; margin-right: auto; padding-top: 0.2em; padding-bottom: 0.2em; line-height: 1.5em!important;">
<h1 class="title">Training LLMs at Scale</h1>
<p class="author">Sam Foreman</p>
<p class="date">2024-08-09</p>
<p class="location">@ ATPESC 2024</p>
<p class="slide-url"></p>
</div>
</section>
<section id="links" class="slide level2 center" data-background-color="white">
<h2>🔗 Links</h2>
<ul>
<li><p>🏡 <a href="https://samforeman.me">samforeman.me</a>:</p>
<ul>
<li>🦜 <a href="https://samforeman.me/talks/">Talks</a>:
<ul>
<li><a href="https://samforeman.me/talks/llms-at-scale/">Training LLMs at Scale</a> [<a href="https://samforeman.me/talks/llms-at-scale/slides.html">slides</a>]</li>
</ul></li>
<li>📦 <a href="https://github.com/saforem2/">Repos</a>:
<ul>
<li><a href="https://github.com/saforem2/ezpz">🍋 <code>saforem2/ezpz</code></a><br>
<span class="dim-text">Train your model across any number of arbitrary devices, ezpz.</span></li>
<li><a href="https://github.com/saforem2/wordplay">💬 <code>saforem2/wordplay</code></a><br>
<span class="dim-text">Playing with words.</span></li>
<li><a href="https://github.com/argonne-lcf/Megatron-DeepSpeed">🏎️ <code>argonne-lcf/Megatron-DeepSpeed</code></a><br>
<span class="dim-text">For only the largest of large language models.</span> </li>
</ul></li>
</ul></li>
</ul>
</section>
<section id="about-me" class="title-slide slide level1 center" data-background-color="white">
<h1>🧑🏻💻 About Me</h1>
<ul>
<li>Computational Scientist at Argonne National Laboratory (ALCF)</li>
<li>Interested in {AI, HPC} for science
<ul>
<li>working on scaling large (language, vision, multi-modal) models</li>
</ul></li>
</ul>
<p>As a member of the <a href="https://www.alcf.anl.gov/about/people/group/506">AI / ML Group</a> at <a href="https://alcf.anl.gov">ALCF</a>, I work on:</p>
<div class="flex-container">
<div class="flex-container">
<ul>
<li>🤖 🧪 <a href="https://github.com/saforem2/">AI + Science</a></li>
<li>🎲 <a href="https://github.com/saforem2/l2hmc-qcd">Building better sampling methods for Lattice QCD</a></li>
<li>🧬 <a href="https://www.biorxiv.org/content/10.1101/2022.10.10.511571v2">Genome-Scale Language Models</a>
<ul>
<li><a href="https://github.com/ramanathanlab/genslm"><iconify-icon role="img" inline="" icon="logos:github-octocat" aria-label="Icon github-octocat from logos Iconify.design set." title="Icon github-octocat from logos Iconify.design set."></iconify-icon> GenSLM</a></li>
<li>🥇 <a href="https://www.acm.org/media-center/2022/november/gordon-bell-special-prize-covid-research-2022">ACM Gordon Bell Special Prize</a></li>
</ul></li>
</ul>
</div>
<div class="flex-container">
<ul>
<li>🌍 <a href="https://saforem2.github.io/climate-analysis">Foundation models for long term climate forecasting</a></li>
<li>🏃♂️ <a href="https://github.com/argonne-lcf/Megatron-DeepSpeed">Scaling Large Language Models</a></li>
<li>🏎️ <a href="https://github.com/argonne-lcf/mlprof">Distributed training across thousands of GPUs</a></li>
</ul>
</div>
</div>
</section>
<section>
<section id="scaling-overview" class="title-slide slide level1 center scrollable" data-background-color="white">
<h1>🚀 Scaling: Overview</h1>
<ul>
<li>✅ <strong>Goal</strong>:
<ul>
<li>📈 Maximize: Performance </li>
<li>📉 Minimize: Cost<sup>1</sup>
<ul>
<li>or, equivalently, 📈 <strong>maximize</strong> data throughput<sup>2</sup></li>
</ul></li>
</ul></li>
</ul>
<aside><div>
<p><strong>Note</strong>: See <a href="https://huggingface.co/docs/transformers/v4.17.0/en/performance">🤗 Performance and Scalability: How To Fit a Bigger Model and Train It Faster</a> for more details</p>
</div><ol class="aside-footnotes"><li id="fn1"><p>Typically, the amount of time (💸) spent training</p></li><li id="fn2"><p>Typically want to utilize as much of GPU as possible</p></li></ol></aside></section>
<section id="ai-compute-historical" class="slide level2 centeredslide smaller center" data-background-color="white">
<h2>AI 🤝 Compute [Historical]</h2>
<div class="flex-container">
<div class="col1" style="font-size: 0.85em; width:35%;">
<ul>
<li><strong>First Era</strong>:
<ul>
<li>[1960 – 2012]</li>
<li><em>2 year</em> doubling (Moore’s law)
<ul>
<li><span class="math inline">\simeq 7\times</span> increase</li>
</ul></li>
</ul></li>
</ul>
<p> <br></p>
<ul>
<li><strong>Modern Era</strong>:
<ul>
<li>[2012 – present]</li>
<li><strong>3.4 month</strong> doubling
<ul>
<li><span class="math inline">\simeq \mathbf{300,000}\times</span> increase</li>
</ul></li>
</ul></li>
</ul>
</div>
<div class="quarto-figure quarto-figure-center">
<figure>
<p><a href="./assets/ai-and-compute-all.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Source."><img data-src="./assets/ai-and-compute-all.png" alt="Source."></a></p>
<figcaption><a href="https://openai.com/research/ai-and-compute">Source.</a></figcaption>
</figure>
</div>
</div>
</section>
<section id="ai-compute-modern" class="slide level2 centeredslide smaller center" data-background-color="white">
<h2>AI 🤝 Compute [Modern]</h2>
<div class="flex-container">
<div class="col1" style="font-size: 0.85em; width:35%;">
<ul>
<li><strong>First Era</strong>:
<ul>
<li>[1960 – 2012]</li>
<li><em>2 year</em> doubling (Moore’s law)
<ul>
<li><span class="math inline">\simeq 7\times</span> increase</li>
</ul></li>
</ul></li>
</ul>
<p> <br></p>
<ul>
<li><strong>Modern Era</strong>:
<ul>
<li>[2012 – present]</li>
<li><strong>3.4 month</strong> doubling
<ul>
<li><span class="math inline">\simeq \mathbf{300,000}\times</span> increase</li>
</ul></li>
</ul></li>
</ul>
</div>
<div class="col2">
<div class="quarto-figure quarto-figure-center">
<figure>
<p><a href="./assets/ai-and-compute-modern-log.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" title="Source."><img data-src="./assets/ai-and-compute-modern-log.png" alt="Source."></a></p>
<figcaption><a href="https://openai.com/research/ai-and-compute">Source.</a></figcaption>
</figure>
</div>
</div>
</div>
</section></section>
<section id="parallelism-concepts" class="title-slide slide level1 scroll-container smaller center scrollable" data-background-color="white" data-scrollable="true" style="max-height: 700px; overflow-y: scroll;">
<h1>Parallelism Concepts</h1>
<div class="panel-tabset" style="font-size: 0.9em;">
<ul id="tabset-1" class="panel-tabset-tabby"><li><a data-tabby-default="" href="#tabset-1-1">Single GPU</a></li><li><a href="#tabset-1-2">Data Parallel (DP)</a></li><li><a href="#tabset-1-3">Tensor Parallel (TP)</a></li><li><a href="#tabset-1-4">Pipeline Parallel (PP)</a></li><li><a href="#tabset-1-5"><iconify-icon role="img" inline="" icon="logos:microsoft-icon" aria-label="Icon microsoft-icon from logos Iconify.design set." title="Icon microsoft-icon from logos Iconify.design set."></iconify-icon> ZeRO</a></li><li><a href="#tabset-1-6">FSDP</a></li></ul>
<div class="tab-content" style="font-size: 0.9em;">
<div id="tabset-1-1">
<div id="fig-single-gpu" class="quarto-float quarto-figure quarto-figure-center">
<figure class="quarto-float quarto-float-fig">
<div aria-describedby="fig-single-gpu-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/single-gpu-step-1.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-3" title="Figure 1: SLOW !! model size limited by GPU memory"><img data-src="./assets/single-gpu-step-1.drawio.svg"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-single-gpu-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 1: <strong>SLOW</strong> !! model size limited by GPU memory
</figcaption>
</figure>
</div>
</div>
<div id="tabset-1-2">
<div class="flex-container">
<div class="col1" style="font-size: 0.85em; width:45%;">
<ul>
<li><p>The simplest and most common parallelism technique</p></li>
<li><p>Workers maintain <em>identical copies</em> of the <em>complete</em> model and work on a <em>subset of the data</em></p>
<ul>
<li>Multiple copies of <strong>the same setup</strong>
<ul>
<li>each copy gets fed <strong>unique</strong> data</li>
<li>all copies compute gradients w.r.t local model</li>
<li>everyone syncs up before updating weights</li>
</ul></li>
</ul></li>
<li><p>The processing is done in parallel and all setups are synchronized at the end of each training step.</p></li>
</ul>
</div>
<div id="fig-ddp-training" class="quarto-float quarto-figure quarto-figure-center">
<figure class="quarto-float quarto-float-fig">
<div aria-describedby="fig-ddp-training-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="./assets/multi-gpu-ddp.drawio.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-4" title="Figure 2: Data Parallel Training"><img data-src="./assets/multi-gpu-ddp.drawio.svg" style="width:90.0%"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ddp-training-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 2: Data Parallel Training
</figcaption>
</figure>
</div>
</div>
</div>
<div id="tabset-1-3">
<div class="flex-container">
<div class="col1" style="font-size: 0.85em; width:45%;">
<ul>
<li>Each tensor is split up into multiple chunks</li>
<li>So, instead of having the whole tensor reside on a single GPU, each shard of the tensor resides on its designated GPU
<ul>
<li>During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step.</li>
<li>This is what one may call horizontal parallelism, as the splitting happens on horizontal level.</li>
</ul></li>
</ul>
</div>
<div id="fig-model-parallel-1" class="quarto-float quarto-figure quarto-figure-center">
<figure class="quarto-float quarto-float-fig">
<div aria-describedby="fig-model-parallel-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="https://saforem2.github.io/distributed-training-slides/assets/model-parallel.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-5" title="Figure 3: Tensor Parallel Training"><img data-src="https://saforem2.github.io/distributed-training-slides/assets/model-parallel.svg"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-model-parallel-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 3: Tensor Parallel Training
</figcaption>
</figure>
</div>
</div>
</div>
<div id="tabset-1-4">
<div class="flex-container">
<div class="col1" style="width:35%;">
<ul>
<li>Model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single GPU
<ul>
<li>Each GPU processes in parallel different stages of the pipeline and working on a small chunk of the batch.</li>
</ul></li>
</ul>
</div>
<div id="fig-pipeline-parallelism" class="quarto-float quarto-figure quarto-figure-center">
<figure class="quarto-float quarto-float-fig">
<div aria-describedby="fig-pipeline-parallelism-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="assets/pipeline_parallelism.png" class="lightbox" data-gallery="quarto-lightbox-gallery-6" title="Figure 4: Pipeline Parallelism"><img data-src="assets/pipeline_parallelism.png"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-pipeline-parallelism-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 4: <a href="https://www.deepspeed.ai/tutorials/pipeline/">Pipeline Parallelism</a>
</figcaption>
</figure>
</div>
</div>
</div>
<div id="tabset-1-5">
<div id="fig-zero-stages" class="quarto-float quarto-figure quarto-figure-center">
<figure class="quarto-float quarto-float-fig">
<div aria-describedby="fig-zero-stages-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png" class="lightbox" data-gallery="quarto-lightbox-gallery-7" title="Figure 5: DeepSpeed + ZeRO"><img data-src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-zero-stages-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 5: <a href="deepspeed.ai">DeepSpeed</a> + <code>ZeRO</code>
</figcaption>
</figure>
</div>
<div class="flex-container">
<div class="col1" style="font-size: 0.75em;">
<ul>
<li><p>Shards tensors (~ similar to TP), <em>except</em>:</p>
<ul>
<li><p><strong>whole tensor</strong> gets reconstructed as needed</p></li>
<li><p>Doesn’t require model modifications !!</p></li>
</ul></li>
<li><p>Depending on the <code>ZeRO</code> stage (1, 2, 3), we can offload:</p>
<ol type="1">
<li><p><strong>Stage 1</strong>: optimizer states</p></li>
<li><p><strong>Stage 2</strong>: gradients + opt. states</p></li>
<li><p><strong>Stage 3</strong>: model params + grads + opt. states</p></li>
</ol>
<p><span class="dim-text">with increasing <code>ZeRO</code> stage, we are able to free up increasing amounts of GPU memory</span></p></li>
</ul>
</div>
<div class="col2" style="font-size: 0.75em;">
<ul>
<li><p><code>ZeRO</code> Data Parallel</p>
<ul>
<li><code>ZeRO</code> powered data parallelism is shown below</li>
</ul></li>
<li><p>It also supports various offloading techniques to compensate for limited GPU memory.</p></li>
<li><p>🔗 See also:</p>
<ul>
<li><a href="https://deepspeed.readthedocs.io/en/latest/zero3.html">ZeRO — DeepSpeed 0.14.5 documentation</a></li>
<li><a href="https://www.deepspeed.ai/tutorials/zero-offload/">ZeRO-Offload</a></li>
<li><a href="https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/">ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters - Microsoft Research</a></li>
</ul></li>
</ul>
</div>
</div>
</div>
<div id="tabset-1-6">
<div class="flex-container">
<div class="col1" style="width: 33%">
<ul>
<li>Instead of maintaining per-GPU copy of <code>{params, grads, opt_states}</code>, FSDP shards (distributes) these across data-parallel workers
<ul>
<li>can optionally offload the sharded model params to CPU</li>
</ul></li>
</ul>
</div>
<div class="quarto-figure quarto-figure-center">
<figure>
<p><a href="assets/fsdp.png" class="lightbox" data-gallery="quarto-lightbox-gallery-8" title="FSDP Workflow. Source"><img data-src="assets/fsdp.png" alt="FSDP Workflow. Source"></a></p>
<figcaption>FSDP Workflow. <a href="https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/">Source</a></figcaption>
</figure>
</div>
</div>
<ul>
<li>🔗 See also:
<ul>
<li><a href="https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/">Introducing PyTorch Fully Sharded Data Parallel (FSDP) API | PyTorch</a></li>
</ul></li>
</ul>
</div>
</div>
</div>
<aside><div>
<p>See: <a href="https://huggingface.co/docs/transformers/v4.15.0/parallelism">🤗 Model Parallelism</a> for additional details</p>
</div></aside></section>
<section>
<section id="data-parallelism" class="title-slide slide level1 center" data-background-color="white">
<h1>Data Parallelism</h1>
</section>
<section id="data-parallel-training" class="slide level2 centeredslide smaller center" data-background-color="white" data-auto-animate="true">
<h2 data-id="quarto-animate-title">Data Parallel Training</h2>
<div class="flex-container">
<div class="col1" style="font-size: 0.85em; width:45%;">
<ul>
<li>Relatively simple to get up and running (minor modifications to code)</li>
<li><i class="fa-brands fa-github" aria-label="github"></i> <a href="https://github.com/saforem2/ezpz"><code>saforem2/ezpz</code></a></li>
<li><a href="https://pytorch.org/docs/stable/notes/ddp.html">PyTorch – DDP</a></li>
<li><a href="https://www.deepspeed.ai/"><iconify-icon role="img" inline="" icon="logos:microsoft-icon" aria-label="Icon microsoft-icon from logos Iconify.design set." title="Icon microsoft-icon from logos Iconify.design set."></iconify-icon> DeepSpeed</a></li>
</ul>
</div>
<div id="fig-avgGrads" class="quarto-float quarto-figure quarto-figure-center">
<figure class="quarto-float quarto-float-fig">
<div aria-describedby="fig-avgGrads-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="https://saforem2.github.io/distributed-training-slides/assets/avgGrads.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-9" title="Figure 6: Data Parallelism"><img data-src="https://saforem2.github.io/distributed-training-slides/assets/avgGrads.svg"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-avgGrads-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 6: Data Parallelism
</figcaption>
</figure>
</div>
</div>
<aside><div>
<p>Also see: <a href="https://youtu.be/930yrXjNkgM">🎬 “Parallel Training Techniques”_</a></p>
</div></aside></section>
<section id="deal-with-data" class="slide level2 smaller center scrollable" data-background-color="white">
<h2>Deal with Data</h2>
<ul>
<li><p>At each training step, we want to ensure that <strong>each worker receives unique data</strong></p></li>
<li><p>This can be done in one of two ways:</p>
<ol type="1">
<li>Manually partition data (ahead of time) and assign different sections to different workers
<ol type="1">
<li>Each worker can only see their local portion of the data</li>
</ol></li>
<li>From each worker, randomly select a mini-batch
<ol type="1">
<li>Each worker can see the full dataset</li>
</ol></li>
</ol>
<div title="⚠️ Warning">
<div class="callout callout-warning no-icon callout-titled callout-style-default">
<div class="callout-body">
<div class="callout-title">
<p><strong>⚠️ Warning</strong></p>
</div>
<div class="callout-content">
<p>Don’t forget your seed!</p>
<p>When randomly selecting, it is important that each worker uses different seeds to ensure they receive unique data</p>
</div>
</div>
</div>
</div></li>
</ul>
</section>
<section id="broadcast-initial-state" class="slide level2 center" data-background-color="white">
<h2>Broadcast Initial State</h2>
<ul>
<li><p>At the start of training (or when loading from a checkpoint), we want all of our workers to be initialized consistently</p>
<ul>
<li><strong>Broadcast</strong> the model and optimizer states from <code>rank() == 0</code> worker</li>
</ul></li>
</ul>
<div class="cell" data-reveal="true" data-layout-align="center">
<div class="cell-output-display">
<div>
<p></p><figure class=""><p></p>
<div>
<pre class="mermaid mermaid-js"> flowchart TD
0["GPU0"] --> 1["GPU 1"]
0 --> 2["GPU 2"]
0 -->|Model + Optimizer State| 3["GPU 3"]
0 --> ...
0 --> N["GPU N"]
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="best-practices" class="slide level2 smaller center" data-background-color="white">
<h2>Best Practices</h2>
<div title="🤝 Keeping things in Sync">
<div class="callout callout-important no-icon callout-titled callout-style-default">
<div class="callout-body">
<div class="callout-title">
<p><strong>🤝 Keeping things in Sync</strong></p>
</div>
<div class="callout-content">
<p><strong>Computation stalls during communication !!</strong></p>
<p>Keeping the communication to computation ratio small is important for effective scaling.</p>
</div>
</div>
</div>
</div>
<ul>
<li>Use parallel IO whenever possible
<ul>
<li>Feed each rank from different files</li>
<li>Use MPI IO to have each rank read its own batch from a file</li>
<li>Use several ranks to read data, MPI to scatter to remaining ranks
<ul>
<li>Most practical in big <em>at-scale</em> training</li>
</ul></li>
</ul></li>
<li>Take advantage of data storage
<ul>
<li>Use <a href="https://wiki.lustre.org/Configuring_Lustre_File_Striping">striping on lustre</a></li>
<li>Use the right optimizations for Aurora, Polaris, etc.</li>
</ul></li>
<li>Preload data when possible
<ul>
<li>Offloading to a GPU frees CPU cycles for loading the next batch of data
<ul>
<li><strong>minimize IO latency this way</strong></li>
</ul></li>
</ul></li>
</ul>
</section>
<section id="why-distributed-training" class="slide level2 center scrollable" data-background-color="white">
<h2>Why Distributed Training?</h2>
<ul>
<li>Splitting data across workers <span class="math inline">\longrightarrow</span> larger batch size<sup>1</sup>
<ul>
<li>[<code>micro_batch_size = 1</code>] <span class="math inline">\times</span> [<code>N</code> GPUs] <span class="math inline">\rightarrow</span> [<b><code>global_batch_size = N</code></b>]</li>
</ul></li>
<li>Smooth loss landscape</li>
<li>Improved gradient estimators</li>
<li>Less iterations needed for same number of epochs
<ul>
<li>May need to train for more epochs if another change is not made</li>
<li>e.g. scaling learning rate</li>
</ul></li>
<li>See <a href="https://arxiv.org/abs/1708.03888">Large Batch Training of Convolutional Networks</a></li>
</ul>
<aside><ol class="aside-footnotes"><li id="fn3"><p><code>micro_batch_size</code> = batch_size <strong>per</strong> GPU</p></li></ol></aside></section>
<section id="recent-progress" class="slide level2 center" data-background-color="white">
<h2>Recent Progress</h2>
<div style="display: -webkit-inline-box; max-width: -webkit-fill-available; overflow: auto; font-size: 0.7em; font-family: monospace;">
<table class="caption-top">
<colgroup>
<col style="width: 7%">
<col style="width: 9%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 23%">
<col style="width: 24%">
<col style="width: 12%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">Year</th>
<th style="text-align: center;">Author</th>
<th style="text-align: center;">Batch Size</th>
<th style="text-align: center;">GPU</th>
<th style="text-align: center;"># GPUs</th>
<th style="text-align: center;">TIME</th>
<th style="text-align: center;">ACC</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">2016</td>
<td style="text-align: center;">He</td>
<td style="text-align: center;">256</td>
<td style="text-align: center;">P100</td>
<td style="text-align: center;"><span class="red-text">8</span></td>
<td style="text-align: center;"><span class="red-text">29 Hour</span></td>
<td style="text-align: center;">75.30%</td>
</tr>
<tr class="even">
<td style="text-align: center;">2019</td>
<td style="text-align: center;">Yamazaki</td>
<td style="text-align: center;">81,920</td>
<td style="text-align: center;">V100</td>
<td style="text-align: center;"><span class="blue-text">2048</span></td>
<td style="text-align: center;"><span class="blue-text">1.2 Min</span></td>
<td style="text-align: center;">75.08%</td>
</tr>
</tbody>
</table>
</div>
</section></section>
<section id="deciding-on-a-parallelism-strategy" class="title-slide slide level1 smaller center scrollable" data-background-color="white">
<h1>Deciding on a Parallelism Strategy</h1>
<div class="panel-tabset">
<ul id="tabset-2" class="panel-tabset-tabby"><li><a data-tabby-default="" href="#tabset-2-1">Single GPU</a></li><li><a href="#tabset-2-2">Single Node / Multi-GPU</a></li><li><a href="#tabset-2-3">Multi-Node / Multi-GPU</a></li></ul>
<div class="tab-content">
<div id="tabset-2-1">
<ul>
<li>Model fits onto a single GPU:
<ul>
<li>Normal use</li>
</ul></li>
<li>Model <strong>DOES NOT</strong> fit on a single GPU:
<ul>
<li><code>ZeRO</code> + Offload CPU (or, optionally, <code>NVMe</code>)</li>
</ul></li>
<li>Largest layer <strong>DOES NOT</strong> fit on a single GPU:
<ul>
<li><code>ZeRO</code> + Enable <a href="https://deepspeed.readthedocs.io/en/latest/zero3.html#memory-centric-tiling">Memory Centric Tiling (MCT)</a>
<ul>
<li>MCT Allows running of arbitrarily large layers by automatically splitting them and executing them sequentially.</li>
</ul></li>
</ul></li>
</ul>
</div>
<div id="tabset-2-2">
<ul>
<li><p>Model fits onto a single GPU</p>
<ul>
<li><a href="https://pytorch.org/docs/stable/notes/ddp.html"><code>DDP</code></a></li>
<li><a href="https://deepspeed.readthedocs.io/en/latest/zero3.html"><code>ZeRO</code></a></li>
</ul></li>
<li><p>Model <strong>DOES NOT</strong> fit onto a single GPU[^connectivity]</p>
<ol type="1">
<li><a href="https://www.deepspeed.ai/tutorials/pipeline/">Pipeline Parallelism (<code>PP</code>)</a></li>
<li><a href="https://deepspeed.readthedocs.io/en/latest/zero3.html"><code>ZeRO</code></a></li>
<li><a href="https://pytorch.org/docs/stable/distributed.tensor.parallel.html">Tensor Parallelism (<code>TP</code>)</a></li>
</ol></li>
<li><p>With sufficiently fast connectivity between nodes, these three strategies should be comparable.</p>
<p>Otherwise, <code>PP</code> <span class="math inline">></span> <code>ZeRO</code> <span class="math inline">\simeq</span> <code>TP</code>.</p></li>
</ul>
</div>
<div id="tabset-2-3">
<ul>
<li><p>When you have fast inter-node connectivity:</p>
<ul>
<li><code>ZeRO</code> (virtually <strong>NO</strong> modifications)</li>
<li><code>PP</code> + <code>ZeRO</code> + <code>TP</code> + <code>DP</code> (less communication, at the cost of <strong>MAJOR</strong> modifications)
<ul>
<li><p>when you have slow inter-node connectivity and still low on GPU memory:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode numberSource bash number-lines code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><a></a><span class="ex">DP</span> + PP + TP + ZeRO-1</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div></li>
</ul></li>
<li><strong>NOTE</strong>: <code>TP</code> is almost <em>always</em> used within a single node, e.g. <code>TP <= GPUS_PER_NODE</code></li>
</ul></li>
</ul>
</div>
</div>
</div>
</section>
<section>
<section id="tensor-model-parallel-training-example" class="title-slide slide level1 smaller center scrollable" data-background-color="white">
<h1>Tensor (/ Model) Parallel Training: Example</h1>
<p><span class="math display">
\begin{align*}
y &= \sum_{i} w_{i} * x_{i} \\
&= w_0 * x_0 + w_1 * x_1 + w_2 * x_2
\end{align*}
</span></p>
<ol type="1">
<li>Compute <span class="math inline">y_{0} = w_{0} * x_{0}</span> and send to <span class="math inline">\longrightarrow</span> <code>GPU1</code></li>
<li>Compute <span class="math inline">y_{1} = y_{0} + w_{1} * x_{1}</span> and send to <span class="math inline">\longrightarrow</span> <code>GPU2</code></li>
<li>Compute <span class="math inline">y = y_{1} + w_{2} * x_{2}</span> ✅</li>
</ol>
<div class="cell" data-reveal="true" data-layout-align="center">
<div class="cell-output-display">
<div>
<p></p><figure class=""><p></p>
<div>
<pre class="mermaid mermaid-js">flowchart LR
subgraph X0["GPU0"]
direction LR
a["w0"]
end
subgraph X1["GPU1"]
direction LR
b["w1"]
end
subgraph X2["GPU2"]
direction LR
c["w2"]
end
X1 & X0 <--> X2
X0 <--> X1
x["x0, x1, x2"] --> X0
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="model-parallel-training" class="slide level2 center" data-background-color="white">
<h2>Model Parallel Training</h2>
<div>
</div>
<div class="quarto-layout-panel" data-layout="[60,40]">
<div class="quarto-layout-row">
<div class="col1 quarto-layout-cell" style="flex-basis: 60.0%;justify-content: flex-start;">
<ul>
<li>Split up network over multiple workers
<ul>
<li>Each receives disjoint subset</li>
<li>All communication associated with subsets are distributed</li>
</ul></li>
<li>Communication whenever dataflow between two subsets</li>
<li>Typically <strong>more complicated</strong> to implement than data parallel training</li>
<li>Suitable when the model is too large to fit onto a single device (CPU / GPU)</li>
<li><i class="fa-brands fa-github" aria-label="github"></i> <a href="https://github.com/argonne-lcf/Megatron-DeepSpeed"><code>argonne-lcf/Megatron-DeepSpeed</code></a></li>
<li>🤗 <a href="https://github.com/huggingface/nanotron"><code>huggingface/nanotron</code></a></li>
</ul>
</div>
<div class="quarto-layout-cell" style="flex-basis: 40.0%;justify-content: flex-start;">
<div id="fig-model-parallel-1" class="quarto-float quarto-figure quarto-figure-center">
<figure class="quarto-float quarto-float-fig">
<div aria-describedby="fig-model-parallel-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<a href="https://saforem2.github.io/distributed-training-slides/assets/model-parallel.svg" class="lightbox" data-gallery="quarto-lightbox-gallery-10" title="Figure 7: "><img data-src="https://saforem2.github.io/distributed-training-slides/assets/model-parallel.svg" id="fig-model-parallel-1"></a>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig quarto-uncaptioned" id="fig-model-parallel-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure 7
</figcaption>
</figure>
</div>
</div>
</div>
</div>
</section>
<section id="tensor-model-parallelismefficient-large-scale" class="slide level2 center scrollable" data-background-color="white">
<h2>Tensor (Model) Parallelism<sup>1</sup></h2>
<ul>
<li><p>In <strong>Tensor Paralleism</strong> each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing.</p>
<ul>
<li><p>The main building block of any transformer is a fully connected nn.Linear followed by a nonlinear activation GeLU.</p>
<ul>
<li><code>Y = GeLU(XA)</code>, where X and Y are the input and output vectors, and A is the weight matrix.</li>
</ul></li>