Skip to content

Commit

Permalink
docs: remove old mention of LuxDeviceUtils
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal committed Sep 24, 2024
1 parent e5a6e7d commit a808aa8
Showing 1 changed file with 1 addition and 8 deletions.
9 changes: 1 addition & 8 deletions docs/src/api/Accelerator_Support/MLDataDevices.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,7 @@ CollapsedDocStrings = true
# [MLDataDevices](@id MLDataDevices-API)

`MLDataDevices.jl` is a lightweight package defining rules for transferring data across
devices. Most users should directly use Lux.jl instead.

!!! note "Transitioning from `LuxDeviceUtils.jl`"

`LuxDeviceUtils.jl` was renamed to `MLDataDevices.jl` in v1.0 as a part of allowing
these packages to have broader adoption outsize the Lux community. However, Lux
currently still uses `LuxDeviceUtils.jl` internally. This is supposed to change with
the transition of Lux to `v1.0`.
devices.

## Preferences

Expand Down

1 comment on commit a808aa8

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: a808aa8 Previous: 5cb86b3 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 414875 ns 411270.5 ns 1.01
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 321479 ns 321459 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322521 ns 243229 ns 1.33
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 740000 ns 739583 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 40861 ns 41187 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1343250 ns 1293854.5 ns 1.04
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 2434250 ns 2409166.5 ns 1.01
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 474937.5 ns 16158416 ns 0.029392577836837474
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2252271 ns 2244124.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 182562 ns 186717.5 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1328292 ns 1386416 ns 0.96
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 2620521 ns 2592167 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 610500 ns 16442917 ns 0.03712844868097309
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2229562.5 ns 2224250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1765917 ns 1760437.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1031334 ns 1084209 ns 0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1365416 ns 1521520.5 ns 0.90
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2818125 ns 2926125 ns 0.96
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 204521 ns 205511.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12152917 ns 12138333 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8828833 ns 8825083.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9300834 ns 9173333 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18599875 ns 18597250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1492272 ns 1482675 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17275187 ns 17291625 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13914875 ns 13944583.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14281833 ns 14521250 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21819042 ns 21811145.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250296521 ns 249976666.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148101750 ns 148133250 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148130792 ns 116316853.5 ns 1.27
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 448565625 ns 449124167 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5496241 ns 5482800 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1226292708 ns 1230523917 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 930446334 ns 928555292 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 443990041 ns 832972542 ns 0.53
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1653613542 ns 1627591875 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35420264 ns 35593150.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1147479875 ns 1137967916 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 996058750 ns 992382854.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 629339646 ns 1336473333 ns 0.47
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1740843604 ns 1743275104 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1116250 ns 1092042 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1624229.5 ns 1598250 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1206375.5 ns 3369875 ns 0.36
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 782041 ns 781500 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 260633 ns 252534 ns 1.03
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2984374.5 ns 2977458 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4127166 ns 4116250 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3295208.5 ns 9642292 ns 0.34
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3137625 ns 3142958 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1049614 ns 1029320 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2315396 ns 2326083 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1424437 ns 1424667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1685208 ns 1560084 ns 1.08
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4196250 ns 4056750 ns 1.03
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 208669.5 ns 208748 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 19413145.5 ns 19404250 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16084375 ns 16063062.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17133041.5 ns 17250542 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 25866542 ns 25830521 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1576194 ns 1568775 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 34217167 ns 34513208 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 30754459 ns 30762479.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 31341542 ns 31225958 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 37132709 ns 36930167 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4525792 ns 4524979.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2744125 ns 2763937.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2881375 ns 2673854.5 ns 1.08
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8371458 ns 8377291.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 423036 ns 421945 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 38892667 ns 39044209 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 32085104.5 ns 32104791.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 32057770.5 ns 32250208 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 52159979.5 ns 51814146 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2618584 ns 2619508 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 89172458 ns 88649667 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 113776875 ns 113984812.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 62985709 ns 225519896 ns 0.28
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 74986500 ns 74452916.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 268884125 ns 268719375 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 159000000 ns 159201750 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 158925750 ns 123545124.5 ns 1.29
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 486715083 ns 491504917 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 6941165 ns 6875777.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1474467645.5 ns 1477656896 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1134657750 ns 1178240000 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 687890791.5 ns 1062550813 ns 0.65
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2033574500 ns 2026628729 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33495275 ns 33057912 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1720167208 ns 1716757416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1551435312.5 ns 1531392041.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1147814729 ns 1872548250 ns 0.61
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2245015792 ns 2230422791 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 2039500 ns 2018375 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3006583 ns 3022624.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1618791.5 ns 7958917 ns 0.20
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2424854.5 ns 2489958 ns 0.97
lenet(28, 28, 1, 128)/forward/GPU/CUDA 258194 ns 253828.5 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9325667 ns 9615917 ns 0.97
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 11994291.5 ns 11905479 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7128604 ns 24859333 ns 0.29
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11753792 ns 11285437.5 ns 1.04
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1096609.5 ns 1089672 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 380363500.5 ns 382104291.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 286893354 ns 288697958.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 129833291 ns 263937500 ns 0.49
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 456069146 ns 453025562.5 ns 1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 5018425 ns 4955655.5 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1154815958 ns 1159602875 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 934037667 ns 937068625 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 609039458 ns 1116269958 ns 0.55
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1585642292 ns 1586148458 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 19065478 ns 18262229 ns 1.04
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1049833.5 ns 1053583 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 2073542 ns 2074479.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1348479.5 ns 6685541 ns 0.20
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1287021 ns 1294959 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 259724.5 ns 256727 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6258792 ns 6501125 ns 0.96
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 12411416 ns 12392708 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4953146 ns 19126833.5 ns 0.26
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6086709 ns 6062209 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1149352.5 ns 1151411.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70546083 ns 70475667 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43491792 ns 43577354.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37811479.5 ns 39785333 ns 0.95
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 134717229.5 ns 132525000 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1859024 ns 1859935 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 355554354.5 ns 356597312.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 270317625 ns 270033292 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146113896 ns 254165791.5 ns 0.57
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 537066979.5 ns 541858104.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 12142155.5 ns 12225780 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 396257791 ns 395590417 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 404428375.5 ns 407040083 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 302176729 ns 686687708 ns 0.44
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 712116709 ns 711343459 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1190477625 ns 1189721417 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 689814958.5 ns 694763792 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 404795334 ns 639415416.5 ns 0.63
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1876404250 ns 1863138250 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12324333 ns 12305225 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3610008479.5 ns 3693716416.5 ns 0.98
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2831662833 ns 2822411042 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1516977229.5 ns 2715785292 ns 0.56
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5143819000 ns 5075752792 ns 1.01
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 50066391.5 ns 50096815 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3345708.5 ns 3427000 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2078625 ns 2064167 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2287083 ns 2526875 ns 0.91
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6026917 ns 6023271 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 330146 ns 343351 ns 0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25733291.5 ns 26112416 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18989125 ns 19044125 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19553792 ns 19200250 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39739583.5 ns 39317542 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2459398 ns 2472756 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 54593479 ns 53212375 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 78905375 ns 86527562.5 ns 0.91
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 29660083.5 ns 174565833 ns 0.17
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45812146 ns 45612312 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1660583.5 ns 1781334 ns 0.93
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1105770.5 ns 1100708 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1392229 ns 1559041 ns 0.89
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3035959 ns 3031000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 210818 ns 209992.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12525958.5 ns 12526708.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9221375 ns 9202062.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9699583 ns 9594042 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 19002416.5 ns 18995083.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1509113 ns 1520295 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17662604.5 ns 17654187.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14311479 ns 14319770.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14590875 ns 14542209 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22225541 ns 22178916 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70524583 ns 70529958 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43452458 ns 43518812.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37882479.5 ns 39678750 ns 0.95
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132685187 ns 132567854.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1859436 ns 1864543.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 359287667 ns 362025167 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 347693812.5 ns 346626458 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 197401167 ns 304297792 ns 0.65
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 730607333 ns 726309042 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13254127 ns 13257700 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 420436292 ns 417934646 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 419235583 ns 420406333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 310533750 ns 710093000 ns 0.44
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 718184500 ns 717186500 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1442542 ns 1662000 ns 0.87
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1346416.5 ns 1348708 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1331812.5 ns 1039209 ns 1.28
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2403021 ns 2446333 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 549048 ns 549327.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 8851250 ns 8831979 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 12939667 ns 12827437 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 5552708 ns 32684875 ns 0.17
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9880416.5 ns 9840916 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1258951 ns 1223627 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 16575062 ns 16482375 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 20954208 ns 22352396 ns 0.94
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 13338833 ns 48470938 ns 0.28
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 13092416 ns 13131854 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 822708 ns 786625 ns 1.05
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 528084 ns 549645.5 ns 0.96
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 71146 ns 1064854.5 ns 0.06681288382591237
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 725750 ns 725104.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 46414.5 ns 45187 ns 1.03
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1506500 ns 1494083 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1020854 ns 1045666.5 ns 0.98
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 323833 ns 1424458 ns 0.23
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2281417 ns 2291291 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 211160.5 ns 207948.5 ns 1.02
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1512416 ns 1496875 ns 1.01
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1090125 ns 1011416 ns 1.08
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 446562.5 ns 1769209 ns 0.25
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2259375 ns 2257000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3176750 ns 3409312.5 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2053979 ns 2052875 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2268708 ns 2516667 ns 0.90
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6008875 ns 5998083 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 282441.5 ns 281525 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24059292 ns 24109729 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17235458 ns 17182959 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16956292 ns 17113146 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37778228.5 ns 37468750.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2390107 ns 2398593 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 52955708.5 ns 52554812 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 84900333 ns 84392000 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27496312.5 ns 170326000 ns 0.16
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44513375.5 ns 44580125 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250307750 ns 250367042 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148084625 ns 147848500 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148444250 ns 116105042 ns 1.28
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 455285000 ns 454852833 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5327018 ns 5326699.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1102117541 ns 1103578584 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 856978792 ns 855911416.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 437778208 ns 831258666.5 ns 0.53
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1768146583 ns 1772502666 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33525724 ns 33341175 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1027855937 ns 1010431750 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 965570792 ns 965660000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 584455270.5 ns 1276218416 ns 0.46
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1726926104.5 ns 1718974833.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1135584 ns 1245354 ns 0.91
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 989209 ns 938208 ns 1.05
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 923667 ns 685500 ns 1.35
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2052500 ns 2004042 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 548882.5 ns 548493.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5867833 ns 5771604 ns 1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 6531896 ns 6597750 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2613541.5 ns 25936583 ns 0.10
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7097417 ns 7098375 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1222578 ns 1220210 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 9683896 ns 9431791 ns 1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 13118666 ns 13114104.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6497583 ns 33204521 ns 0.20
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7614083.5 ns 7606042 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 512667 ns 430541 ns 1.19
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 391292 ns 381021 ns 1.03
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 32750 ns 3043792 ns 0.010759605124134632
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 87812.5 ns 89542 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 25759 ns 25679 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 382125 ns 354583 ns 1.08
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 444875 ns 443833 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 160875 ns 4158750 ns 0.03868349864743012
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 258750 ns 258750 ns 1
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 188723 ns 188553 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 420291.5 ns 385584 ns 1.09
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 475750 ns 474625 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 194375 ns 4412750 ns 0.04404849583592998
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 270958 ns 271208 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 461312.5 ns 376687.5 ns 1.22
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 326666.5 ns 325000 ns 1.01
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 14792 ns 771479 ns 0.019173561432002686
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54145.5 ns 54583 ns 0.99
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 26082 ns 26029 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 340312 ns 303687.5 ns 1.12
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 342500 ns 341166.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 25958.5 ns 893375 ns 0.029056667133062822
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151625 ns 151583.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 181930 ns 180458.5 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 357792 ns 316479 ns 1.13
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 357833 ns 355958.5 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 46437.5 ns 833687.5 ns 0.055701326936052176
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 151209 ns 150917 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 602226667 ns 603139000 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 427648645.5 ns 430379625 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 207084708 ns 380417500 ns 0.54
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 882976625 ns 876424750 ns 1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 6984740 ns 7025391.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1997486771 ns 2008872896 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1621644791.5 ns 1619900021 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 856167166 ns 1577697458 ns 0.54
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2637178042 ns 2622523542 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26468421.5 ns 26902392.5 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 520062.5 ns 535729 ns 0.97
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 429271 ns 431416.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 166000 ns 2478833.5 ns 0.06696698265534978
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 866083 ns 866124.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 46206 ns 44614 ns 1.04
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1874625 ns 1911229 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 2508792 ns 2468667 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 1021958 ns 16401666 ns 0.062308182595597304
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2650063 ns 2768395.5 ns 0.96
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 217141.5 ns 210772.5 ns 1.03
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 1862417 ns 1986854 ns 0.94
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 5033959 ns 5052500 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 1161917 ns 16457750 ns 0.07059999088575292
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 2752500 ns 2773062.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1462229 ns 1594542 ns 0.92
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1192834 ns 1175979 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1192667 ns 932458.5 ns 1.28
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2221791 ns 2307417 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 550464 ns 543745 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5883792 ns 5990542 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 4676563 ns 5767687 ns 0.81
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2871000 ns 25963208 ns 0.11
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7325000.5 ns 7322042 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1196239.5 ns 1137388.5 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 11670958.5 ns 11669000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 16372334 ns 16638833.5 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8780584 ns 38505541 ns 0.23
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9544250 ns 9523103.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2458 ns 2770.5 ns 0.89
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2542 ns 2541 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 2875 ns 3542 ns 0.81
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 4625 ns 2167 ns 2.13
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 22670 ns 21552 ns 1.05
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 6916 ns 7167 ns 0.96
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7083 ns 7083 ns 1
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7250 ns 7250 ns 1
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7333 ns 7250 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 180475.5 ns 173171.5 ns 1.04
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8250 ns 8167 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8292 ns 8208 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8542 ns 8584 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 6125 ns 5979.5 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10916.5 ns 10959 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 12625 ns 13437.5 ns 0.94
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10459 ns 10250 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 9729 ns 7208 ns 1.35
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 22420 ns 21706 ns 1.03
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 19916 ns 19916 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 19875 ns 20000 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 19958 ns 20209 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 20000 ns 19854.5 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 195313 ns 188017 ns 1.04
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 23542 ns 23666 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 23541 ns 23625 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 27125 ns 23708 ns 1.14
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 21334 ns 21334 ns 1
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28834 ns 28625 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28708 ns 28875 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 29042 ns 28208 ns 1.03
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46291 ns 45875 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 23925 ns 23317 ns 1.03
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 224750 ns 233812.5 ns 0.96
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 276542 ns 277666 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 44250 ns 3990583 ns 0.011088605349143221
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 145000 ns 145083 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 197967 ns 191945 ns 1.03
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 242125 ns 250666.5 ns 0.97
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 293916 ns 295459 ns 0.99
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 68604.5 ns 4148750 ns 0.01653618559807171
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145584 ns 145562.5 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1583 ns 2041 ns 0.78
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 2166 ns 1916 ns 1.13
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2166.5 ns 2584 ns 0.84
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 4333.5 ns 1625 ns 2.67
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 20975.5 ns 20024 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5084 ns 5375 ns 0.95
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5125 ns 5125 ns 1
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5209 ns 5250 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5500 ns 5084 ns 1.08
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 234449.5 ns 238397 ns 0.98
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 7375 ns 7541 ns 0.98
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 7458 ns 7416 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 8125 ns 7750 ns 1.05
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5459 ns 5250 ns 1.04
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 80045708 ns 79842084 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 49037958.5 ns 49100250 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 42791749.5 ns 43191750 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 151490583 ns 151456000 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2680013 ns 2712652 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 606632959 ns 472190041 ns 1.28
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 411440583 ns 413693042 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 292411917 ns 397758813 ns 0.74
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 737907354 ns 737522187.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16971190.5 ns 16943151 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 714524875 ns 710270771 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 672104708 ns 668321833 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 580514646 ns 1002011792 ns 0.58
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 1012152875 ns 997156208 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.