Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: remove deprecations for 1.0 release #82

Merged
merged 7 commits into from
Aug 30, 2024
Merged

fix!: remove deprecations for 1.0 release #82

merged 7 commits into from
Aug 30, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Jul 10, 2024

Copy link

codecov bot commented Jul 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.41%. Comparing base (8dc51b0) to head (e47e8ba).
Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
- Coverage   83.68%   80.41%   -3.28%     
==========================================
  Files          38       38              
  Lines        1900     1899       -1     
==========================================
- Hits         1590     1527      -63     
- Misses        310      372      +62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: e47e8ba Previous: 8dc51b0 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5583 ns 5833 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5958 ns 6209 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7209 ns 6500 ns 1.11
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6708 ns 6333 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 117750 ns 118732 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2860850 ns 2968100 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 3361583 ns 730042 ns 4.60
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 421144 ns 417444 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9916.5 ns 9834 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9833 ns 9937.5 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9917 ns 10083 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9625 ns 10083 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 553140 ns 577266 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18595297 ns 19534378 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2382917 ns 2672542 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 696305 ns 679157 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1625 ns 1583 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1688 ns 1875 ns 0.90
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 2958.5 ns 1666 ns 1.78
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1437.5 ns 1583.5 ns 0.91
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21723.5 ns 21231 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1340515 ns 1454941.5 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 208292 ns 209312 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 37181 ns 30810.5 ns 1.21
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3750 ns 4125 ns 0.91
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4167 ns 4083 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4291.5 ns 4375 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4459 ns 4083 ns 1.09
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 145687 ns 141204.5 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8279576 ns 8535587 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1490500 ns 1628312.5 ns 0.92
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 148211.5 ns 151661 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 57875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39750 ns 40125 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39958 ns 39792 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83083 ns 82833 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37422.5 ns 36293 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 578922.5 ns 561260.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1029729.5 ns 992500 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78625.5 ns 82050.5 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2019625 ns 2036834 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2085458 ns 2075792 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2085375 ns 2052042 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2000666 ns 1989479.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 231656 ns 223552.5 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 7765871 ns 8096655 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7650583 ns 7643042 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1504421 ns 1110381 ns 1.35
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 149083.5 ns 145625 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 146917 ns 154708.5 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150000 ns 174688 ns 0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147250 ns 154145.5 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165605 ns 165157 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7579764 ns 7006708 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1671333 ns 1598583 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185332 ns 185621 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1120208 ns 1111542 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1112249.5 ns 1113792 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1119979 ns 1117416 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1115458.5 ns 1116375 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 697776 ns 667136.5 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31396155 ns 33531181.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6206875 ns 6722500 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1041688 ns 916229 ns 1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5125 ns 4208.5 ns 1.22
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 5083 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5583 ns 4875 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4208 ns 4375 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 93533.5 ns 88783 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5284344 ns 5683274 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 465584 ns 465729 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 59600 ns 71591 ns 0.83
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8625 ns 8708 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8875 ns 8625 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8833 ns 8625 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8459 ns 9042 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 598346 ns 582943 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 33477623 ns 37889755 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6013458.5 ns 5975583.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 390743 ns 389426 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17354.5 ns 18250.5 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17520.5 ns 18833.5 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18916 ns 18541 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18708.5 ns 18021 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 67316 ns 65828.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2841743 ns 2797898 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1301187.5 ns 1292083.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75270.5 ns 77641 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 211750 ns 212959 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221250 ns 212416 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212979.5 ns 223375 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220959 ns 219958 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 355350 ns 345170 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 12887660 ns 13000126.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5578875 ns 5618187 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 475568.5 ns 472507 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 583.5 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 792 ns 750 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 709 ns 708 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20658 ns 20278 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1117791 ns 1174845 ns 0.95
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 293750 ns 284041.5 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 32570 ns 34141 ns 0.95
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1417 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1437.5 ns 1375 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1500 ns 1458 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1334 ns 1375 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 125407.5 ns 122996.5 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8435986 ns 8936472 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1520937 ns 1545542 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 124981 ns 128652 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7292 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5334 ns 5416 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5458 ns 5334 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10292 ns 10125 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24335 ns 23494.5 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1244509 ns 1206688.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 613771 ns 352291.5 ns 1.74
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 46950 ns 48921 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221791 ns 265208 ns 0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 263562.5 ns 228583 ns 1.15
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 267459 ns 268375 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258042 ns 220208 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 191390.5 ns 191406 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31212923 ns 34275082 ns 0.91
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9028021 ns 9545416 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 615105 ns 615580 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4166 ns 4125 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4084 ns 4084 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23747 ns 23388 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2059889 ns 1884551 ns 1.09
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 224375 ns 222625 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48710.5 ns 50581 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16417 ns 16500 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16666 ns 16541 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17166.5 ns 16666 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16500 ns 16500 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 196190.5 ns 191032 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10575444.5 ns 9654050 ns 1.10
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 1220604 ns 1315416 ns 0.93
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 178941 ns 179153 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 511375 ns 511083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 332250 ns 332542 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 331958 ns 332750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865541 ns 865000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113960 ns 113564 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 396373 ns 397782 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 455458 ns 399542 ns 1.14
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 247962 ns 249264 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2265542 ns 2268937 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1741145.5 ns 1755645.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1750125 ns 1746583 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3194667 ns 3196292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 240998 ns 236643 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12033885 ns 9269331 ns 1.30
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1913833 ns 1892000 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 763086 ns 761836.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6625 ns 6167 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6104 ns 6250 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7709 ns 7875 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6709 ns 6292 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 90571.5 ns 90951 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5303527.5 ns 5183601.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 773833.5 ns 790084 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 60371 ns 60171 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11104.5 ns 9729.5 ns 1.14
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11541.5 ns 11833.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11583.5 ns 10709 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11041 ns 11250 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 621523 ns 631820 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38208156 ns 38968720 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5786083 ns 5635041.5 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 413623 ns 413756 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 541 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23874 ns 22959 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2259468 ns 2250193 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 228959 ns 229979.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 51460 ns 51060 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2209 ns 2083 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2083 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 218615 ns 238043 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 10999156 ns 12339690 ns 0.89
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1993375 ns 1997542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 179811 ns 176033 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9375 ns 8458 ns 1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9709 ns 8604.5 ns 1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11667 ns 10250 ns 1.14
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9083 ns 8458 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 107558 ns 111812 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3162682 ns 2954218 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 851875 ns 809875 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 77041 ns 75421 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17958.5 ns 17729.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 19042 ns 17854 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18833 ns 18479.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17541.5 ns 17500 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 597196.5 ns 612415.5 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 17134824 ns 16447833 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5474187 ns 5303292 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 387393 ns 386655 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35659 ns 35148 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1185739.5 ns 1185387 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 293083 ns 379167 ns 0.77
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 47871 ns 45811 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9583.5 ns 8625.5 ns 1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9333.5 ns 9625 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9270.5 ns 9833 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8937.5 ns 8979.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 259605 ns 266322.5 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18447546 ns 19024975 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5011104 ns 5023625 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 374128 ns 376345 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398875 ns 398458 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215584 ns 215375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215083 ns 215625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755958 ns 756084 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111970 ns 110416.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 332768 ns 325801 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 386416 ns 380603.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 79430 ns 78551 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1388333 ns 1395208.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 857833 ns 859166.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 858042 ns 860417 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2356750 ns 2356542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 207644 ns 203387 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 8781675 ns 10253444.5 ns 0.86
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1598729 ns 1668583 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 322992.5 ns 324309.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7312.5 ns 7521 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7709 ns 7208 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8375 ns 7937.5 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7125 ns 7354.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 143151 ns 146147.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 6385139 ns 5499314 ns 1.16
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 448750 ns 448604 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 61490 ns 60691 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15020.5 ns 14937.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14291 ns 13604.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14625 ns 13667 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15854 ns 15375.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 972405 ns 955436.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 49449537.5 ns 43131702 ns 1.15
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5975895.5 ns 5899125.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 437174 ns 433397 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25083 ns 24125 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25125 ns 24708.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28500 ns 28229 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25687.5 ns 24895.5 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 201048 ns 196723 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8216742 ns 7737736 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1213500 ns 1117208 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 118471 ns 117742 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 145292 ns 103770.5 ns 1.40
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 147625 ns 117375.5 ns 1.26
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 113812.5 ns 147541.5 ns 0.77
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 148562.5 ns 159541 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1082636 ns 1058384 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42070250 ns 44485069 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5751979 ns 5929750 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 601145 ns 590519 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74542 ns 75041 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76959 ns 75021 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80042 ns 76729.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74000 ns 85708 ns 0.86
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 210048 ns 203053 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7715013 ns 7420031.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 543000 ns 532041.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 125206 ns 125262 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 301875 ns 274937.5 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 284709 ns 306333 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 211791 ns 314500 ns 0.67
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 299770.5 ns 291333 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1128416 ns 1113767.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43276791 ns 41752889.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6601000 ns 6339625 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 702246 ns 696159 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16958 ns 16375 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17500 ns 17166 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18667 ns 17708.5 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16292 ns 16500 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 148481 ns 149324.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5705815.5 ns 5632259 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 524625 ns 451041 ns 1.16
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239632 ns 238583.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27313 ns 25041.5 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 28000 ns 27458.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26959 ns 27208 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 25250.5 ns 27417 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 984571.5 ns 967445 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 39101861 ns 42171811 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6098270.5 ns 5985271 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 714026 ns 714285 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11084 ns 10541 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11417 ns 10708 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13500 ns 12167 ns 1.11
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11083 ns 10416 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 126073 ns 124817.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3886871.5 ns 3419436 ns 1.14
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 831084 ns 811375 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239702 ns 240213 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21542 ns 21834 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22083 ns 21917 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21958 ns 22667 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21667 ns 22875 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 706566 ns 693267 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 20349369 ns 20616607 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5568417 ns 5554812 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 687945 ns 675879 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 62437.5 ns 63875 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 63291.5 ns 65458 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 66458 ns 68750 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 62750 ns 63291 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107293.5 ns 106862 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3660159 ns 3236758 ns 1.13
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1323062.5 ns 1339728.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 239692 ns 237523 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 490792 ns 436417 ns 1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 443541 ns 449729 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 450500 ns 447750 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 437917 ns 486125 ns 0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 517439 ns 515853 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21639679 ns 20960752 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6076541.5 ns 6146771 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 731271.5 ns 715734 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7208.5 ns 7250.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7521 ns 7041.5 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9042 ns 8708 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7125 ns 6916.5 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 146603.5 ns 146046 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 6392989 ns 5957852 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 458687.5 ns 454417 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59111 ns 59271 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14875 ns 14771 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15062.5 ns 15479 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14937.5 ns 15062 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15125 ns 14000 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 952699.5 ns 942484.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 40798622 ns 39118999 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5781813 ns 5667729 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 405764 ns 407846 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6157292 ns 6158125.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 3225250 ns 3218166 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 3226625 ns 3227708 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11915500 ns 11925375 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 350478 ns 351461 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 296627.5 ns 299264 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19132270.5 ns 19150312.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 11022312 ns 11075104 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 11088416 ns 11106625 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36416791.5 ns 36514875 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1067365 ns 1053961 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1157365 ns 1154031 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 917 ns 958 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1042 ns 1000 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 1041 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 959 ns 958 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23582 ns 23131 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2239100 ns 2063993 ns 1.08
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 288125 ns 232479 ns 1.24
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 214282 ns 213903 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3708 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3750 ns 3667 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3709 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3625 ns 3667 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 282213 ns 280249 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11258687 ns 11110622 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2144000 ns 2136458 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 648245 ns 645129 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8625 ns 8250 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8875 ns 7791.5 ns 1.14
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9667 ns 9125 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8667 ns 8396 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 122810 ns 121638.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3572398.5 ns 3248289.5 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 748125 ns 788916 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 69920 ns 67611 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11875 ns 11729.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 13208.5 ns 12271 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 13292 ns 13459 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11791.5 ns 11770.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 647865 ns 639448 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22137350 ns 20615290 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 4443083 ns 5086271 ns 0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 365428 ns 366630 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22406 ns 22523 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2171056 ns 2092333 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 227500 ns 223666.5 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 53190 ns 52621 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2958 ns 2875 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 2959 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3416 ns 3042 ns 1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 2875 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 204810 ns 203283 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 10200655 ns 9008155 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1731833 ns 1643667 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 161696.5 ns 171352 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11833 ns 10209 ns 1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11792 ns 11875 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13333 ns 13000 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11000.5 ns 11291 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 123504 ns 122118 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3318538 ns 3370469 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 955979 ns 932041 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 238252 ns 239973.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23083.5 ns 20833 ns 1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22292 ns 20771 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21917 ns 21541.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21417 ns 22729 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 601694.5 ns 592817 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21113099.5 ns 20103668 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4708208 ns 4792708 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 667256 ns 667099 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4417 ns 4416 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4417 ns 4417 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4458 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4416 ns 4417 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24698 ns 24053 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2115886.5 ns 2139501 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 222604 ns 223416 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 54070 ns 54331 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16167 ns 16292 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16417 ns 16375 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16459 ns 16375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16229.5 ns 16312.5 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 332326 ns 328788 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12612468 ns 12357389.5 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1596583 ns 1610333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 216202 ns 214938 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2084 ns 2042 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2166 ns 2042 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2208 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2042 ns 2000 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36354 ns 36532 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1200010 ns 1144768 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 315833 ns 338417 ns 0.93
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 208602 ns 206372 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17583 ns 17708.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 17687.5 ns 19145.5 ns 0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 17500 ns 18687.5 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 17375 ns 17583.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 295580 ns 294488 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20027287 ns 21056777.5 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5448166 ns 4806541.5 ns 1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 694207 ns 704000 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 58875 ns 61291.5 ns 0.96
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 60625 ns 60708 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 61083 ns 61791 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51792 ns 51625 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66673 ns 66466 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 94791 ns 97471 ns 0.97
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 159167 ns 193333 ns 0.82
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 144395.5 ns 132604 ns 1.09
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 135645.5 ns 153021 ns 0.89
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 218917 ns 255166.5 ns 0.86
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 218558 ns 218241 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 584845 ns 583953 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85042 ns 83208 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82792 ns 82958 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84000 ns 87041.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80708 ns 86458 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190714 ns 191093 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5317512.5 ns 5412302 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2004833 ns 1964604.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 171982 ns 170373 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1918646 ns 1871250 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1920146.5 ns 1923625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1922500 ns 1926625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1926417 ns 1695083 ns 1.14
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 537463 ns 533673 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 25759235 ns 27973144 ns 0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8892541.5 ns 8716646 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1086039.5 ns 1083959 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21729 ns 21925 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2156001.5 ns 2103570 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 340438 ns 323625 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 46710 ns 45100 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1791 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 254431 ns 253156 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9970916 ns 9676564 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1487229 ns 1486062.5 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 187061 ns 183853 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8583 ns 9250 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 10709 ns 8791 ns 1.22
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11104 ns 11541.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8625 ns 9833 ns 0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 120562 ns 119759.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3392235 ns 3304871 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 896875 ns 911583 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 237707 ns 242563 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10541 ns 9084 ns 1.16
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10708 ns 9083 ns 1.18
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10145.5 ns 9875 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9583 ns 10542 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 533774 ns 528445 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21000394 ns 20929851 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4276334 ns 4465959 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 627816 ns 649598 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 58187.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39375 ns 39583 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39645.5 ns 39833 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83709 ns 83167 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40383 ns 39718.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1402434 ns 1335928 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1146125 ns 1144792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78591 ns 76941 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924459 ns 1876458.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1974917 ns 1982000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1974604 ns 1975334 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1871479 ns 1876084 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 223824 ns 223366 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33746868 ns 33121559 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11596333 ns 11069000 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1032529 ns 1033133.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 418208 ns 419792 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 419792 ns 419416 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 425479 ns 420417 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 424000 ns 417833 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 214170 ns 209830.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7394482 ns 7621895 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 544334 ns 539709 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 285852 ns 287624 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 760250 ns 670083 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 674416 ns 762791.5 ns 0.88
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 735937.5 ns 739541 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 698562.5 ns 764667 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1053639 ns 1045546 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44748386 ns 42506282 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6735604 ns 6380125 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 919018.5 ns 921656 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3464521 ns 3366854.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3422854 ns 3432979 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3395334 ns 3458292 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3468645.5 ns 3357375 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 175338 ns 176639 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8033050 ns 8129736 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1413708.5 ns 1393270.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 428514 ns 423500.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6215854.5 ns 6223146 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6089541.5 ns 6217459 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6227000 ns 6240625 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6184187.5 ns 6221312.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1007413 ns 997179 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51999371 ns 50529292.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7858354.5 ns 8164709 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1564094 ns 1566429.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 474958 ns 473000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 253500 ns 254042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 253250 ns 254542 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 901666 ns 902333 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46720 ns 46242.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 389374 ns 825428 ns 0.47
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 425291 ns 517333 ns 0.82
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 250522 ns 250313 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2250542 ns 2279166.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1761750 ns 1761750 ns 1
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1761166 ns 1764396 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3198959 ns 3193125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 269668 ns 268875.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 8958333 ns 13207390 ns 0.68
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2163500 ns 2166292 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 786867 ns 784110 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57792 ns 57375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39333 ns 39292 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39791 ns 39541 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83417 ns 83667 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28664 ns 28000 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 978884 ns 1420961.5 ns 0.69
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1153500 ns 1133895.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75511 ns 78041 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2028042 ns 1783500 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2056020.5 ns 2087458 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2086000 ns 2091417 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1945229 ns 1973375 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 236349.5 ns 235065 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 38351191 ns 34323841 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11477729 ns 11467646 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1054329.5 ns 1053243 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58541 ns 57500 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39875 ns 39791 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 40167 ns 39875 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82792 ns 83333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 50484.5 ns 49753 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 809837 ns 807009.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1111291 ns 1110750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78201 ns 71821 ns 1.09
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1881500 ns 1870083 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1941229.5 ns 1974791.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1971250 ns 1975458.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1896833 ns 1719417 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 242923.5 ns 242025 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 18110920.5 ns 17950511 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9855750 ns 9840104.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 930267 ns 928181 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35373 ns 35044 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1269712 ns 1224421 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 440854.5 ns 279916 ns 1.57
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 47780 ns 50520 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6791.5 ns 6083 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 7041 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6959 ns 7542 ns 0.92
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6958 ns 6583 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 215041.5 ns 212138.5 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20712502.5 ns 20858604.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4803791.5 ns 4933020.5 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 375493 ns 377125 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32582 ns 32102 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1261691 ns 1246143 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 254541.5 ns 252500 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 43691 ns 40121 ns 1.09
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2958 ns 3250 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3167 ns 2833 ns 1.12
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2958 ns 3417 ns 0.87
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2834 ns 3166 ns 0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 189747.5 ns 187793.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 8052543 ns 7423467 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 938459 ns 930666 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 162191.5 ns 159252 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 450145.5 ns 426395.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 446959 ns 423458 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 429042 ns 453437.5 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 422229 ns 422541.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 138250 ns 138012 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6236248.5 ns 6078596 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2128729 ns 2105875 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 373698.5 ns 351154 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3793417 ns 3627187.5 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3811000 ns 3781646 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3814875 ns 3818708.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3787042 ns 3816750.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 714820 ns 714220.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33262062 ns 32708263 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10779334 ns 10437208 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1498493.5 ns 1330337 ns 1.13
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49901250 ns 49952500 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 25981417 ns 25992042 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 25983500 ns 25974771 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97079479.5 ns 97060375 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1594678 ns 1609718.5 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1014749 ns 1005437.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154541375 ns 154751187.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 88793000 ns 88411625 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 88530458 ns 89142125 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 294936604.5 ns 295023146 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6471554 ns 6525541 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5536819 ns 5541499 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 17979 ns 17458.5 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 15459 ns 15417 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 13000 ns 13916 ns 0.93
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15146 ns 15187 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20648 ns 20963 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1156334 ns 1029086 ns 1.12
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 224875 ns 221417 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 27171 ns 27290 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11125 ns 10625 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 7729 ns 7687.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 7854.5 ns 7895.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17250 ns 17333.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 263885.5 ns 262988 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9825365 ns 11032315 ns 0.89
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1608208 ns 1558750 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 152662 ns 153002 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9000 ns 7917 ns 1.14
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8771 ns 8333.5 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10396 ns 11125 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8833.5 ns 8250 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 116927 ns 116148 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3541234 ns 3496720 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 800667 ns 797854 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 240932 ns 240663 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9729.5 ns 10021 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10187.5 ns 10083.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 10791.5 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9604 ns 10584 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 626219.5 ns 627842 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26581249 ns 22890536.5 ns 1.16
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5185125.5 ns 4718917 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 668776 ns 670993.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10187.5 ns 9271 ns 1.10
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9792 ns 9541 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10917 ns 10875 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9292 ns 9270.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 124324.5 ns 122880.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3445385.5 ns 3253148 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 931250 ns 918333 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72601 ns 73381 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13791 ns 15083 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 15042 ns 14167 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 17042 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14875 ns 14667 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 599265 ns 595348 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19837253 ns 19444920 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4467250 ns 4763896 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 354288 ns 353084 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 666 ns 459 ns 1.45
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 459 ns 459 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35015 ns 35417 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1242520 ns 1186574 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 426916.5 ns 416604 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 208692 ns 209112 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8937.5 ns 8979 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9375 ns 10292 ns 0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 10416.5 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8917 ns 8729.5 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 236066 ns 233445.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21882713 ns 21282401 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5349250 ns 5435416.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 656976 ns 676048 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 15792 ns 15708 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 13416 ns 14583 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 12583.5 ns 12416 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10979 ns 9937 ns 1.10
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 22506 ns 21468 ns 1.05
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1159535 ns 1188974 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 197166 ns 204687 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 187121.5 ns 182912 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32125 ns 32083 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32125 ns 31979 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32125 ns 32583 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32042 ns 31916 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 281987.5 ns 277811 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11256812 ns 11129104.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1704270.5 ns 1607584 ns 1.06
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 602756 ns 603987 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 439208 ns 443583 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 440312 ns 441395.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 442437.5 ns 443312.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 452250 ns 439937.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194353 ns 194190 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6187612 ns 5958005 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2010833.5 ns 1994958 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 371658.5 ns 350285 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3836125 ns 3816917 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3828416.5 ns 3836875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3833375 ns 3840729.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3804958 ns 3801625 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 543630.5 ns 546260 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28316634 ns 28675319 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9576666 ns 9200208 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1217281 ns 1220685 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 781279958 ns 783919458 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 418024250 ns 415090937.5 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 415003958 ns 416149396 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1553302312.5 ns 1556394646 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22534687 ns 22758802.5 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14053357 ns 14026629 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2540355333 ns 2531412125 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1525674250 ns 1503429375 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1510867083 ns 1511972625 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 5211355166 ns 5238183333 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 372139138 ns 341968825.5 ns 1.09
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88484108 ns 89112141 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76833.5 ns 76084 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 80438 ns 77666 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79375 ns 79333 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 77000 ns 88708 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 210860 ns 209926 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7939257.5 ns 7624963 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 556833 ns 538459 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 110121 ns 111431 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 194333 ns 193479 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 196250 ns 195396 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 280604.5 ns 255791 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 209625 ns 263084 ns 0.80
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1044977.5 ns 1056306 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44213920 ns 42921068 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6328458 ns 6096396 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 635831 ns 638587 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 200007250 ns 199996979.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 103851687 ns 104048375 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 103904750 ns 103857041 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388866500 ns 389154708 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5820988 ns 5838520 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3429802 ns 3416961 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 621011562.5 ns 619738166.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 351243917 ns 352609750 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 354523166 ns 353140208 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1184086167 ns 1179908250 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26473310 ns 26709121 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21855057 ns 21908376.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5416 ns 5292 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 5375 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 9958 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28684 ns 27949 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1197511 ns 1220698 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 675667 ns 445770.5 ns 1.52
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48700 ns 49951 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215000 ns 214667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221959 ns 222250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221958.5 ns 222083.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215917 ns 217354.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 219373 ns 226060 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 33465057 ns 31786029 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9200792 ns 9164667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 538594 ns 535067 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9271 ns 8208.5 ns 1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8708 ns 7375 ns 1.18
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10417 ns 10708 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7562.5 ns 8021 ns 0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 117805.5 ns 118426 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3416067 ns 3289610 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 906500 ns 894959 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 74050 ns 75861 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 8458.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9021 ns 8458 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11583 ns 11000 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8875 ns 8625 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 511809 ns 525710 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18851406 ns 19476117 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4467000 ns 4580750 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 318773 ns 323803.5 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 709 ns 542 ns 1.31
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 709 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25714 ns 26694 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1273031 ns 1232825 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 450458.5 ns 334666 ns 1.35
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 48810 ns 51101 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16084 ns 12833 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 12146 ns 11125 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 12125 ns 12708 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 11334 ns 11375 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 250965 ns 255932 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23303938.5 ns 23574064 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5365646 ns 5957187 ns 0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 389799 ns 393244 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 106291 ns 106916 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 84625 ns 84416 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 86166 ns 85416 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146500 ns 146729 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 24955 ns 24228 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1163867 ns 1231419 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 262458 ns 259958 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 185231.5 ns 188842 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 479417 ns 478583.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 519521 ns 479354.5 ns 1.08
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 481771 ns 479416 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 504437.5 ns 522125 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 230703 ns 234731 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11688664.5 ns 11445580 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2205416 ns 2175521 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 617466 ns 622217.5 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5875 ns 5208 ns 1.13
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 7333 ns 6958 ns 1.05
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7000 ns 7167 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6312.5 ns 5020.5 ns 1.26
batchedmm(16, Bsize=32)/forward/GPU/CUDA 15960 ns 17348 ns 0.92
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 79085.5 ns 79231 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 13208 ns 12875 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10667 ns 12083 ns 0.88
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11167 ns 12667 ns 0.88
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17083.5 ns 17708.5 ns 0.96
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 211295 ns 217078.5 ns 0.97
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 373923 ns 388595 ns 0.96
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39209 ns 39625 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 50708 ns 50625 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 51083 ns 51000 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13541.5 ns 13666.5 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 21656 ns 20461 ns 1.06
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 80316 ns 83341 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 37833 ns 38542 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 32083 ns 29917 ns 1.07
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 30083 ns 31417 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57250 ns 66000 ns 0.87
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 189684 ns 195656.5 ns 0.97
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 406643.5 ns 398885 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1916.5 ns 1770.5 ns 1.08
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1625 ns 1.15
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2125 ns 2292 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1791.5 ns 1729.5 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20698 ns 21146 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1151725 ns 1123716.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 318416.5 ns 302958 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 30320 ns 28491 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2208.5 ns 2229.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2416 ns 0.90
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2375 ns 2375 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2166 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 201882 ns 205300 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 8916839 ns 9074561 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1544750 ns 1516937.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 138336.5 ns 138212 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6271 ns 5645.5 ns 1.11
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4854.5 ns 4771 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6312.5 ns 6604 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5042 ns 4979.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 143284 ns 147775 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5900971 ns 6128313 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 567291.5 ns 450875 ns 1.26
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 62250 ns 62371 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8958.5 ns 8958 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9250 ns 8750 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9438 ns 9125 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8687.5 ns 9625 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 864087 ns 883717 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 38587534 ns 41518756 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5770625 ns 5658500 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391113 ns 388034 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56750 ns 56709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56916 ns 56833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57000 ns 56917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58250 ns 58292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37169 ns 38043 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1177949 ns 1221995 ns 0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 363395.5 ns 611541 ns 0.59
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 207172 ns 207452 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 451792 ns 450937.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 467500 ns 466917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 466312.5 ns 468562.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 441500 ns 473167 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 264130.5 ns 271371 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27819058 ns 26618792 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8272125 ns 8082167 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 816138 ns 807824 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3324375 ns 3309813 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1763125 ns 1763625 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1769958 ns 1772167 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6313291.5 ns 6307500 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205480 ns 206270.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 205832 ns 211692.5 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11532979 ns 11489208 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 6556937.5 ns 6543312.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 6559812.5 ns 6593875 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21146833 ns 21174666.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 740245 ns 735714 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1073505 ns 1071922.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6083 ns 6437 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4917 ns 5125 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6000 ns 7604.5 ns 0.79
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5209 ns 6021 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 137073 ns 141217 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5763070.5 ns 5736528 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 761979.5 ns 743958 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 58111 ns 58020 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7166 ns 7750 ns 0.92
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 13625 ns 8791 ns 1.55
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7416 ns 7417 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7146 ns 8084 ns 0.88
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 748959 ns 759240 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 37270377 ns 35174267 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5566708.5 ns 5288042 ns 1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 379893 ns 379024.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 97625 ns 97583 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 95708 ns 101959 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 97708 ns 127542 ns 0.77
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 122083 ns 96084 ns 1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 149717.5 ns 153040 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5815106.5 ns 5764876 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2046458 ns 2076375 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 186092 ns 184732 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2032063 ns 1822416 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2035417 ns 2035833.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2025750 ns 2031521 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2034812.5 ns 2029667 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 699402 ns 712381 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33359753 ns 32235030 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10780625 ns 10817667 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1123756 ns 1119068 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 32895.5 ns 32771 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35958 ns 34958 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 32292 ns 33834 ns 0.95
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 625 ns 584 ns 1.07
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15283 ns 16070 ns 0.95
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 80500 ns 80701 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2625 ns 2645.5 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3291 ns 4250 ns 0.77
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3042 ns 3083 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2334 ns 2979.5 ns 0.78
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 137331.5 ns 140484 ns 0.98
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 346583 ns 362954 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7166 ns 7250 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5416 ns 5375 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 10167 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36591 ns 37558 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1255336 ns 1203117.5 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 361167 ns 351958 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48660 ns 50591 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212187 ns 215229 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221041.5 ns 223042 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221312 ns 221041.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206750 ns 216292 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 242479 ns 247737.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26291133.5 ns 28210154.5 ns 0.93
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8127688 ns 7826917 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 523004.5 ns 518941 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 4000 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3959 ns 3958 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3959 ns 3959 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21550 ns 22280 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2157769 ns 2135337 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 248541 ns 244750 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 45911 ns 45821 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14667 ns 14708 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14708 ns 14708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14750 ns 14750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14708 ns 14708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 304620 ns 313766.5 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 10977767 ns 11565919.5 ns 0.95
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1043625 ns 996417 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 199612 ns 196698 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 106000 ns 102375 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 98291.5 ns 98375 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102542 ns 130667 ns 0.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 128833 ns 101541 ns 1.27
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135179 ns 142696 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5987975 ns 6012180 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2089312.5 ns 2060042 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 186671 ns 185242 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1921917 ns 1678708 ns 1.14
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1911646 ns 1919562.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1921417 ns 1925646 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1918250 ns 1715750 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 683849 ns 697882 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32200299 ns 32586423 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10813854.5 ns 10270770.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1072735 ns 1227914 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17875 ns 20125 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18042 ns 18666 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21499.5 ns 20125 ns 1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18541 ns 19041.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107857 ns 111256 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3405654 ns 3316785.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1342500 ns 1342375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81381 ns 77136 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216416.5 ns 216708 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226729 ns 217270.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217666.5 ns 217000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 228458.5 ns 257500 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 513449.5 ns 522548.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19313650 ns 19703098 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5992125 ns 6106875 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 473434 ns 495696 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 23937.5 ns 23625 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 28875 ns 28917 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 26500 ns 27167 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1416 ns 1542 ns 0.92
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15770 ns 16593 ns 0.95
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 82311 ns 83321 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4833 ns 4937.5 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5000 ns 4709 ns 1.06
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5166 ns 5125 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4625 ns 5479 ns 0.84
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 205185 ns 210967 ns 0.97
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 383233 ns 384204.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 307000 ns 304709 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 307333 ns 305417 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 309000.5 ns 307312.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 306959 ns 304999.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 227362 ns 231440.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7656672 ns 7899776.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 650375 ns 1048666.5 ns 0.62
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 275572 ns 278713 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 537562 ns 531667 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 532667 ns 537916 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 535625 ns 559833 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 542458 ns 535042 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1070334 ns 1077983 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44223125 ns 46672590 ns 0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6462771 ns 6185542 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 869258 ns 867079 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19500 ns 21000 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19958 ns 19792 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23375 ns 21333.5 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19958 ns 20125 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112543.5 ns 115430.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3491900 ns 3543630 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1414625 ns 1426729 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77381 ns 77991 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213208 ns 212667 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213479 ns 214292 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215042 ns 213916 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214958 ns 219958 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 750498 ns 758463 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24681158 ns 25339852 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7223895.5 ns 7150812.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 544035 ns 549146 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6417 ns 6666 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6875 ns 7000.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8292 ns 8374.5 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6500 ns 6396 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 138363 ns 144368 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5558527 ns 5600145 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 777187 ns 781083 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 69061 ns 69300 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10375 ns 10917 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10291 ns 10041.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10708.5 ns 10791 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9709 ns 11250.5 ns 0.86
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 819633 ns 829126 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 39220243 ns 38035335 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5518708 ns 5400125 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 385803 ns 389489 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5958 ns 6333 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4875 ns 5291 ns 0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6917 ns 7042 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4729.5 ns 4562.5 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 142280 ns 146644 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5842056.5 ns 5614464 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 769459 ns 767750 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 59561 ns 60400 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7209 ns 7583 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7666 ns 7750 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7917 ns 7625 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7583 ns 8625 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 778165 ns 788273 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 39496840.5 ns 39532384.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5854688 ns 5788792 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 404144 ns 390144 ns 1.04
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14575750 ns 14512959 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 7731500 ns 7746083 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 7698583 ns 7719437.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27811541 ns 27824167 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 535283 ns 532712 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 407233 ns 405110 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46519437 ns 46254125 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 26552479.5 ns 26514813 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 26436334 ns 26596375 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85626334 ns 85595417 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2913979.5 ns 2648732 ns 1.10
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3300841 ns 3291677 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66500 ns 69916 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 66709 ns 66666.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68312.5 ns 67604 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67500 ns 69812.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 105648 ns 119643.5 ns 0.88
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3451369.5 ns 3502655.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1470250.5 ns 1447479.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 234332 ns 236773 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 440250 ns 480313 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 441125 ns 447125 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 445625 ns 447937.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 442624.5 ns 444459 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 729654 ns 735182 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26852660 ns 27836501.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7754417 ns 7344541.5 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 803477.5 ns 795239 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32133 ns 32854 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1168342.5 ns 1222475 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 351479 ns 464063 ns 0.76
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 49250 ns 50950 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8875 ns 8250 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9271 ns 8687.5 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 8667 ns 9646 ns 0.90
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9083 ns 15771 ns 0.58
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 283467 ns 289332 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 22237807 ns 22396972 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5030812.5 ns 5647520.5 ns 0.89
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 384844 ns 389255 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9834 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9834 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9875 ns 9875 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9834 ns 9791 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23024 ns 23549 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2093062 ns 2127803 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 223166 ns 223688 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 216402 ns 215812 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45750 ns 45583 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45583 ns 45833 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46000 ns 45834 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45750 ns 45792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 285399.5 ns 292557 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 9799339 ns 11637949 ns 0.84
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 968750 ns 1005416 ns 0.96
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 625876 ns 620161.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56250 ns 56250 ns 1
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56458 ns 56375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 56459 ns 56458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57917 ns 57750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28644 ns 29238.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1187526 ns 1197390 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 631292 ns 658208 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205262 ns 204172 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 459458 ns 451791.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 465375 ns 471500 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 497666.5 ns 468000 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 476896 ns 441791.5 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 243906 ns 250364.5 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 33120338 ns 32745444 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9379625 ns 10042062.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 852638 ns 848179.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 586000 ns 581125.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 645146 ns 649645.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 591042 ns 657583 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 660999.5 ns 614250 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 206101 ns 209963 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8668014 ns 8555661.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1370250 ns 1375959 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 238977 ns 264153 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2245208 ns 2243542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2238291.5 ns 2233479 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2233812.5 ns 2247312 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2238042 ns 2249041 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 956767.5 ns 981693 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48728968 ns 47646947 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7240916.5 ns 7438458 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1384018 ns 1260099 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19458 ns 25000 ns 0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19687.5 ns 19625.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22667 ns 21959 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 31250 ns 19167 ns 1.63
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111255 ns 114255 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3455101.5 ns 3641620.5 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1420333.5 ns 1425646 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78081 ns 82081 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225333 ns 256541.5 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221583 ns 220250 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228792 ns 221687.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 232292 ns 221750 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 721610 ns 733642 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25857434.5 ns 27659496.5 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7765792 ns 7468958 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 566570 ns 559661.5 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22767 ns 23294 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1215266.5 ns 1199626 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 452375 ns 380395.5 ns 1.19
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 52441 ns 50321 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9500 ns 9083.5 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10271 ns 10167 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10625 ns 10271 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9875 ns 11333 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 265487 ns 269037 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25062592 ns 25065409.5 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6018666 ns 5606334 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 420344 ns 414904 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10500 ns 8583 ns 1.22
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8062.5 ns 8458 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10292 ns 10458 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9042 ns 7625 ns 1.19
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 118808 ns 121505 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3418649 ns 3438400 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 885187.5 ns 884250 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 71665.5 ns 69061 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7542 ns 7333.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7875 ns 7542 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7916.5 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7417 ns 8000 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 505134.5 ns 512016 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17565081 ns 18614285.5 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4294625 ns 4265271 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 329113 ns 331073.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1541 ns 1334 ns 1.16
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1708 ns 1625 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1895.5 ns 2000 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1375 ns 1458 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21715 ns 20878 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1184887 ns 1144746 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 308875 ns 305042 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 187651 ns 191532 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3375 ns 3375 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3375 ns 3375 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3542 ns 3708.5 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3291 ns 3458 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 219260 ns 220885.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10744223.5 ns 10272002 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1724187.5 ns 1658437.5 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 593095 ns 594146 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148145.5 ns 149042 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 106334 ns 106104 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 107187.5 ns 107459 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 233354 ns 225625 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23884 ns 24697 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1182223 ns 1197055 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 300000 ns 300625 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 36950 ns 38181 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 144520.5 ns 144084 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 87687 ns 100709 ns 0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 87792 ns 87937.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 251833 ns 263895.5 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 216029 ns 219366 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10660743 ns 11143376 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2107666.5 ns 2064125 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 239532 ns 226117.5 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5292 ns 5333 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5334 ns 5334 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10292 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32714 ns 33744 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1156018.5 ns 1208626.5 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 352875 ns 394645.5 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 53221 ns 50650 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220062.5 ns 220458.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228729.5 ns 236458 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228833 ns 229542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224271 ns 213437 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 260760 ns 266362.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27980428 ns 26810792 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8578229 ns 8119062.5 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 534445 ns 532916 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14917 ns 15250 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15312.5 ns 14812.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16708 ns 16792 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14834 ns 15292 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 139169 ns 142309 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5708443 ns 5521569 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 797834 ns 788458 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 241323 ns 239123 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23646 ns 23209 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23812 ns 24208 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23958 ns 24104.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23709 ns 23500 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 860230 ns 874682 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38347185 ns 39249650.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5856625 ns 5835021 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 700376.5 ns 702463 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8834 ns 10062.5 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9792 ns 9792 ns 1
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11083 ns 11375 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8916 ns 9250 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 122964 ns 124966.5 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3472918 ns 3573835 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 908125 ns 826250 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 74011 ns 71705.5 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14500 ns 13250 ns 1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14146 ns 14021 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15354 ns 14833 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14125 ns 14250 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 660283 ns 673097 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21362028 ns 21882593 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5340833 ns 5231334 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 375084 ns 372554 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9687.5 ns 10083.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10520.5 ns 9333 ns 1.13
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11625 ns 10917 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9583 ns 9791 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 121557 ns 124389 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3413782 ns 3411999.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 932500 ns 932500 ns 1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 71111 ns 71241 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12791 ns 12625 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12500 ns 12625 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13708 ns 13313 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12166 ns 12375 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 548626 ns 557333 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 20221743.5 ns 19402473 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4648729 ns 4633542 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 350268.5 ns 348154 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29604.5 ns 29708 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 31542 ns 31750 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 30375 ns 29667 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1833 ns 1834 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA 15946 ns 16586 ns 0.96
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 74191 ns 74511 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5125 ns 5292 ns 0.97
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 4791.5 ns 4542 ns 1.05
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5291.5 ns 5375 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6375 ns 6667 ns 0.96
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 138206 ns 142234 ns 0.97
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 374353 ns 371734 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25381 ns 26130 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1154493.5 ns 1255720 ns 0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 446500 ns 468750 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48930 ns 48500 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6209 ns 6542 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6792 ns 6542 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6583 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6167 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 184823.5 ns 190203.5 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 23775568 ns 23758213 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5401167 ns 5392792 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393993 ns 393904 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 1958 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2000 ns 2000 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2084 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 1958 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25927 ns 27189 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1174327 ns 1199767 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 313812.5 ns 312750.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 208652 ns 208272 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16875 ns 15916.5 ns 1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16666 ns 16291 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16291.5 ns 16979 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16687.5 ns 16312.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 271953.5 ns 276740.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 28538231.5 ns 24755518 ns 1.15
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5705375 ns 5979167 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 711016.5 ns 715538 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148084 ns 180833 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 164437 ns 151333.5 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150583.5 ns 179000 ns 0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 184958 ns 147562.5 ns 1.25
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 198930 ns 207596 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7772893 ns 7810338 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1453625 ns 1464083.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 196832 ns 195132 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1306854 ns 1308625 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1304812.5 ns 1320417 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1334500.5 ns 1326167 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1335563 ns 1318250 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 896336.5 ns 915789.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 44103385 ns 47829317 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6551250 ns 6477041 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1123231 ns 1020372 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26000 ns 26333 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25229 ns 24750 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27479.5 ns 27709 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24791 ns 29458.5 ns 0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 235714.5 ns 237299.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8360389 ns 7668370 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 618125 ns 1182167 ns 0.52
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 106221 ns 121321 ns 0.88
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 180291.5 ns 181812.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 119292 ns 118083 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 119104.5 ns 129000 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 133396 ns 118458 ns 1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1061050 ns 1085787 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 47965532 ns 43559074 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6177667 ns 6188875 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 624876 ns 603482 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22572 ns 23112 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1251092 ns 1222588 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 324125 ns 395646 ns 0.82
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 48860.5 ns 48781 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6042 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6708.5 ns 6833 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7020.5 ns 6729.5 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6479.5 ns 6354 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 201712.5 ns 206261.5 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24190754 ns 24411973 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5519229 ns 5650084 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393274 ns 392024.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7083 ns 6166 ns 1.15
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6479.5 ns 5417 ns 1.20
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8375 ns 8250 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6334 ns 6416 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 144445.5 ns 148283 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5784772 ns 5523038 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 451791 ns 469750 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 237723 ns 237302 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9708.5 ns 10354 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10354.5 ns 10166 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10312.5 ns 10291 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9958 ns 10125 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 894173 ns 909984 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41046274 ns 43302207 ns 0.95
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6098750 ns 5927833 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 677436.5 ns 689088 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 708 ns 0.88
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 708 ns 625 ns 1.13
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22187 ns 22992 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2028286 ns 2053209 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 228479.5 ns 227000 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 214902 ns 215913 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4583 ns 4584 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4625 ns 4708 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4625 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4584 ns 4583 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 222465 ns 228362.5 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10522858 ns 10246488 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1645875 ns 1762500 ns 0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 596496 ns 596946 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8229.5 ns 8791.5 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 9083.5 ns 8021 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10208.5 ns 10208.5 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7834 ns 8625 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 121070 ns 123762 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3530511 ns 3582537 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 831083 ns 795292 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 70060 ns 70171 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 8500 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8958 ns 9084 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9333 ns 9291 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8250 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 586511 ns 599222 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 21323888.5 ns 22439265 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4802708.5 ns 4920229 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 354733 ns 352418.5 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 125292 ns 126375 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 96708 ns 96167 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 97250 ns 96396 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183416 ns 183208 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45670 ns 46448 ns 0.98
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 99990.5 ns 94021 ns 1.06
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 302791 ns 302354.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 168083 ns 168625 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 166833 ns 178500 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 607229.5 ns 568625 ns 1.07
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 189831.5 ns 193426.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 489695 ns 485945.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398375 ns 398500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215333 ns 214958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215125 ns 215459 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756459 ns 755958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43130 ns 43652 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1398407.5 ns 1354730.5 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 412042 ns 489291.5 ns 0.84
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83571 ns 83401 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1405604.5 ns 1416708 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 863250 ns 861208 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 861479.5 ns 863229.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2358542 ns 2359083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 249090 ns 249519.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 10996775 ns 11581786 ns 0.95
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1820250 ns 1843542 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 355383 ns 354834 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 611208 ns 651104 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 648500 ns 636792 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 648812 ns 662104.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 662875 ns 581792 ns 1.14
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194388.5 ns 204117 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8240834 ns 7983269 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1397562 ns 1360250 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 254103 ns 255778 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2466021 ns 2460458 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2458875 ns 2454583 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2463604.5 ns 2468375 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2452250 ns 2463875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 981852.5 ns 992828 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51226623.5 ns 53061666.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7566875 ns 7675854 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1486799.5 ns 1495551 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32542 ns 32562.5 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 34750 ns 34584 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 32229.5 ns 32583.5 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 917 ns 833 ns 1.10
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15560 ns 15923 ns 0.98
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 78491 ns 73991 ns 1.06
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3083 ns 3145.5 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3479.5 ns 3416 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3334 ns 3458.5 ns 0.96
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3125 ns 3084 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 136477.5 ns 139769 ns 0.98
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 359243 ns 346409 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 407250 ns 407417 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 402125 ns 401791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 401833 ns 401916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 421584 ns 421167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43081.5 ns 43360 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1443601.5 ns 1424417 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1160541.5 ns 1149708 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 242377.5 ns 244183 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3877250 ns 3883958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3991438 ns 3996708.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3995500 ns 3992125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3778791.5 ns 3780895.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 240481 ns 246111 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36046095 ns 36934379 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11740520.5 ns 11631750 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1247192 ns 1246158.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3916 ns 3958 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3958 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3959 ns 3917 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33151 ns 33757 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1246525 ns 1234748.5 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 178083 ns 181500.5 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40841 ns 43060 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15459 ns 15500 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15708 ns 15583 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15792 ns 15666 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15584 ns 15541 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 250190 ns 256020 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 9448692 ns 10686428 ns 0.88
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 867250 ns 870458 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 170041 ns 178281 ns 0.95
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404041 ns 404000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 221437.5 ns 220792 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 221041 ns 221375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760667 ns 760833 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112818 ns 113651 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1033657 ns 1020025 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 396500 ns 412687.5 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 90181 ns 91036 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1428625 ns 1438417 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 887375 ns 887125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 886333 ns 888167 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2382896 ns 2384958 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 235417.5 ns 242637 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 9699881 ns 9528776 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1899708.5 ns 1851667 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 356683 ns 357334 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25300 ns 25949.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1213264.5 ns 1192514 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 303583 ns 296583.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 208412 ns 211622 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 7083 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7833 ns 8000 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7854.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 7333 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 208958.5 ns 216752 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26254851.5 ns 24950983 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5627271 ns 5888042 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 691516 ns 701642.5 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 828417 ns 813667 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 465812.5 ns 465792 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 471166.5 ns 467791 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1541979 ns 1544375 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130118 ns 132054 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 178896.5 ns 162431 ns 1.10
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2704041 ns 2686208 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1527521 ns 1528708 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1546750 ns 1538542 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4937042 ns 4933917 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 239281 ns 240514 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 775748 ns 859970 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31418 ns 32094 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1183060 ns 1252325 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 307687.5 ns 323021 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 48455.5 ns 48681 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 5917 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6708.5 ns 6333 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6562.5 ns 6792 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6312.5 ns 6083 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 222050 ns 223941.5 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22089679 ns 23466112 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5038937.5 ns 5053625 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 368804 ns 369274 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2384708 ns 2397083 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2406334 ns 2379291 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2401187.5 ns 2394625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2400334 ns 2379250 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 198668 ns 200806.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7953848.5 ns 8223452 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1483208 ns 1521917 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 357533 ns 359128.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4652749.5 ns 4667500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4657895.5 ns 4598667 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4677042 ns 4663834 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4656375 ns 4654084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 891976 ns 896769 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50384184.5 ns 49138075.5 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6325542 ns 6734812.5 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1259107 ns 1407615 ns 0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6792 ns 7479.5 ns 0.91
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7000 ns 7125 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6917 ns 7125 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7375.5 ns 8020.5 ns 0.92
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23575 ns 23691.5 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1197552 ns 1204234 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 263166 ns 260979.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 37341 ns 33710 ns 1.11
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 52604 ns 44792 ns 1.17
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 45604 ns 33042 ns 1.38
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 49875.5 ns 33459 ns 1.49
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 66000.5 ns 71791.5 ns 0.92
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 219530 ns 217114 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 11373243 ns 10571611 ns 1.08
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2074458 ns 2004833 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 240673 ns 241352 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 20750 ns 20458.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 24750 ns 24625 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 22083.5 ns 22625 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5958 ns 6041 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16981 ns 17905 ns 0.95
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 86491 ns 86031 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12041 ns 11958 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 9333 ns 9417 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 9625 ns 9500 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18083 ns 18000 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 230179 ns 230114.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 380594 ns 377559 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406208 ns 406625 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 223541 ns 223250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 223145.5 ns 223833 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762667 ns 762833 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46914 ns 46575 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1384166 ns 1399200.5 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 415875 ns 406583 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 89301 ns 89521 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1427084 ns 1445834 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 891979 ns 892854.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 891958 ns 893333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2386312.5 ns 2385770.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 287696.5 ns 281827 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 12534990 ns 11465517 ns 1.09
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2042416 ns 2034937.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 375789 ns 378964 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 433959 ns 434333 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 430208 ns 430667 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 430208 ns 430166 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 447500 ns 447292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 55750 ns 55027 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 987998 ns 1009771.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1135146 ns 1109791.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236932 ns 236872 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3911708 ns 3915542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4023250 ns 4022187.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4023416 ns 4023854 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3815521 ns 3802354 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 261796.5 ns 265046 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33894952 ns 31022310 ns 1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10609979 ns 10484042 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1239582 ns 1238903.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8792 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 6875 ns 6916 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 6916 ns 6875 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12459 ns 12458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23476 ns 23854 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2195009 ns 2189159 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 226375 ns 227167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 218422 ns 216382 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 44667 ns 44833 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 44875 ns 45000 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45416 ns 45083 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 44834 ns 44750 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 341928 ns 339090 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11449609 ns 13813520 ns 0.83
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1758917 ns 1746834 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 662766 ns 671917 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85687.5 ns 87063 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82125 ns 92271 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88250 ns 125250 ns 0.70
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 87750.5 ns 88396 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190673 ns 189900.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5914536.5 ns 5870133 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1998792 ns 1961729.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 208012 ns 204047 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027062.5 ns 2028417 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2018395.5 ns 2022208.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2022916.5 ns 2025000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2027750 ns 2024000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 529341.5 ns 536109.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27514758 ns 30231842.5 ns 0.91
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9494875 ns 9333542 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1104271 ns 1104742 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/1.0 branch 2 times, most recently from 1f16397 to 1a3d7fa Compare August 21, 2024 14:59
@avik-pal avik-pal force-pushed the ap/1.0 branch 4 times, most recently from 790b513 to 38f9941 Compare August 29, 2024 19:10
@avik-pal avik-pal merged commit ef784ed into main Aug 30, 2024
74 of 75 checks passed
@avik-pal avik-pal deleted the ap/1.0 branch August 30, 2024 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant