-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix!: remove deprecations for 1.0 release #82
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #82 +/- ##
==========================================
- Coverage 83.68% 80.41% -3.28%
==========================================
Files 38 38
Lines 1900 1899 -1
==========================================
- Hits 1590 1527 -63
- Misses 310 372 +62 ☔ View full report in Codecov by Sentry. |
aeaaaf9
to
dba7835
Compare
ae5f2ad
to
a156e06
Compare
19ae927
to
259549d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: e47e8ba | Previous: 8dc51b0 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5583 ns |
5833 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
6209 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7209 ns |
6500 ns |
1.11 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6708 ns |
6333 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
117750 ns |
118732 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2860850 ns |
2968100 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
3361583 ns |
730042 ns |
4.60 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
421144 ns |
417444 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9916.5 ns |
9834 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9833 ns |
9937.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9917 ns |
10083 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9625 ns |
10083 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
553140 ns |
577266 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18595297 ns |
19534378 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2382917 ns |
2672542 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
696305 ns |
679157 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1625 ns |
1583 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1688 ns |
1875 ns |
0.90 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
2958.5 ns |
1666 ns |
1.78 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1437.5 ns |
1583.5 ns |
0.91 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21723.5 ns |
21231 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1340515 ns |
1454941.5 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
208292 ns |
209312 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
37181 ns |
30810.5 ns |
1.21 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3750 ns |
4125 ns |
0.91 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4167 ns |
4083 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4291.5 ns |
4375 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4459 ns |
4083 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
145687 ns |
141204.5 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
8279576 ns |
8535587 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1490500 ns |
1628312.5 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
148211.5 ns |
151661 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57667 ns |
57875 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39750 ns |
40125 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39958 ns |
39792 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83083 ns |
82833 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37422.5 ns |
36293 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
578922.5 ns |
561260.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1029729.5 ns |
992500 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78625.5 ns |
82050.5 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2019625 ns |
2036834 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2085458 ns |
2075792 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2085375 ns |
2052042 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2000666 ns |
1989479.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
231656 ns |
223552.5 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
7765871 ns |
8096655 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7650583 ns |
7643042 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1504421 ns |
1110381 ns |
1.35 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
149083.5 ns |
145625 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146917 ns |
154708.5 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
150000 ns |
174688 ns |
0.86 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147250 ns |
154145.5 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165605 ns |
165157 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7579764 ns |
7006708 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1671333 ns |
1598583 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
185332 ns |
185621 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1120208 ns |
1111542 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1112249.5 ns |
1113792 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1119979 ns |
1117416 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1115458.5 ns |
1116375 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
697776 ns |
667136.5 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31396155 ns |
33531181.5 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6206875 ns |
6722500 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1041688 ns |
916229 ns |
1.14 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5125 ns |
4208.5 ns |
1.22 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4750 ns |
5083 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5583 ns |
4875 ns |
1.15 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4208 ns |
4375 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
93533.5 ns |
88783 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5284344 ns |
5683274 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
465584 ns |
465729 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
59600 ns |
71591 ns |
0.83 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8625 ns |
8708 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8875 ns |
8625 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8833 ns |
8625 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8459 ns |
9042 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
598346 ns |
582943 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
33477623 ns |
37889755 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6013458.5 ns |
5975583.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
390743 ns |
389426 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17354.5 ns |
18250.5 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17520.5 ns |
18833.5 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18916 ns |
18541 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18708.5 ns |
18021 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
67316 ns |
65828.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
2841743 ns |
2797898 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1301187.5 ns |
1292083.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75270.5 ns |
77641 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
211750 ns |
212959 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221250 ns |
212416 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
212979.5 ns |
223375 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220959 ns |
219958 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
355350 ns |
345170 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
12887660 ns |
13000126.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5578875 ns |
5618187 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
475568.5 ns |
472507 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
583.5 ns |
625 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
792 ns |
750 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
709 ns |
708 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20658 ns |
20278 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1117791 ns |
1174845 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
293750 ns |
284041.5 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
32570 ns |
34141 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1417 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1437.5 ns |
1375 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1500 ns |
1458 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1334 ns |
1375 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
125407.5 ns |
122996.5 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8435986 ns |
8936472 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1520937 ns |
1545542 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
124981 ns |
128652 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7292 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5334 ns |
5416 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5458 ns |
5334 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10292 ns |
10125 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24335 ns |
23494.5 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1244509 ns |
1206688.5 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
613771 ns |
352291.5 ns |
1.74 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
46950 ns |
48921 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221791 ns |
265208 ns |
0.84 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
263562.5 ns |
228583 ns |
1.15 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
267459 ns |
268375 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
258042 ns |
220208 ns |
1.17 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
191390.5 ns |
191406 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31212923 ns |
34275082 ns |
0.91 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9028021 ns |
9545416 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
615105 ns |
615580 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4125 ns |
4084 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4166 ns |
4125 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4084 ns |
4084 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23747 ns |
23388 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
2059889 ns |
1884551 ns |
1.09 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
224375 ns |
222625 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48710.5 ns |
50581 ns |
0.96 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16417 ns |
16500 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16666 ns |
16541 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17166.5 ns |
16666 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16500 ns |
16500 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
196190.5 ns |
191032 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
10575444.5 ns |
9654050 ns |
1.10 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
1220604 ns |
1315416 ns |
0.93 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
178941 ns |
179153 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
511375 ns |
511083 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
332250 ns |
332542 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
331958 ns |
332750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865541 ns |
865000 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113960 ns |
113564 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
396373 ns |
397782 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
455458 ns |
399542 ns |
1.14 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
247962 ns |
249264 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2265542 ns |
2268937 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1741145.5 ns |
1755645.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1750125 ns |
1746583 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3194667 ns |
3196292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
240998 ns |
236643 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
12033885 ns |
9269331 ns |
1.30 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1913833 ns |
1892000 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
763086 ns |
761836.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6625 ns |
6167 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6104 ns |
6250 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7709 ns |
7875 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6709 ns |
6292 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
90571.5 ns |
90951 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5303527.5 ns |
5183601.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
773833.5 ns |
790084 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
60371 ns |
60171 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11104.5 ns |
9729.5 ns |
1.14 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11541.5 ns |
11833.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11583.5 ns |
10709 ns |
1.08 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11041 ns |
11250 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
621523 ns |
631820 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38208156 ns |
38968720 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5786083 ns |
5635041.5 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
413623 ns |
413756 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
541 ns |
541 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
541 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
541 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
541 ns |
541 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23874 ns |
22959 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2259468 ns |
2250193 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
228959 ns |
229979.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
51460 ns |
51060 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2209 ns |
2083 ns |
1.06 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2083 ns |
2125 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
218615 ns |
238043 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
10999156 ns |
12339690 ns |
0.89 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
1993375 ns |
1997542 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
179811 ns |
176033 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9375 ns |
8458 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9709 ns |
8604.5 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11667 ns |
10250 ns |
1.14 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9083 ns |
8458 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
107558 ns |
111812 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3162682 ns |
2954218 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
851875 ns |
809875 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
77041 ns |
75421 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17958.5 ns |
17729.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
19042 ns |
17854 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18833 ns |
18479.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17541.5 ns |
17500 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
597196.5 ns |
612415.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
17134824 ns |
16447833 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5474187 ns |
5303292 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
387393 ns |
386655 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
459 ns |
1.27 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
35659 ns |
35148 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1185739.5 ns |
1185387 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
293083 ns |
379167 ns |
0.77 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47871 ns |
45811 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9583.5 ns |
8625.5 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9333.5 ns |
9625 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9270.5 ns |
9833 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8937.5 ns |
8979.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
259605 ns |
266322.5 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18447546 ns |
19024975 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5011104 ns |
5023625 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
374128 ns |
376345 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398875 ns |
398458 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
215584 ns |
215375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215083 ns |
215625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755958 ns |
756084 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111970 ns |
110416.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
332768 ns |
325801 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
386416 ns |
380603.5 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
79430 ns |
78551 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1388333 ns |
1395208.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
857833 ns |
859166.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
858042 ns |
860417 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2356750 ns |
2356542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
207644 ns |
203387 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
8781675 ns |
10253444.5 ns |
0.86 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1598729 ns |
1668583 ns |
0.96 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
322992.5 ns |
324309.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7312.5 ns |
7521 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7709 ns |
7208 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8375 ns |
7937.5 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7125 ns |
7354.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
143151 ns |
146147.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
6385139 ns |
5499314 ns |
1.16 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
448750 ns |
448604 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
61490 ns |
60691 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15020.5 ns |
14937.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14291 ns |
13604.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14625 ns |
13667 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15854 ns |
15375.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
972405 ns |
955436.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
49449537.5 ns |
43131702 ns |
1.15 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5975895.5 ns |
5899125.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
437174 ns |
433397 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25083 ns |
24125 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25125 ns |
24708.5 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28500 ns |
28229 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25687.5 ns |
24895.5 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
201048 ns |
196723 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8216742 ns |
7737736 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1213500 ns |
1117208 ns |
1.09 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
118471 ns |
117742 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
145292 ns |
103770.5 ns |
1.40 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
147625 ns |
117375.5 ns |
1.26 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
113812.5 ns |
147541.5 ns |
0.77 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
148562.5 ns |
159541 ns |
0.93 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1082636 ns |
1058384 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42070250 ns |
44485069 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5751979 ns |
5929750 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
601145 ns |
590519 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74542 ns |
75041 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76959 ns |
75021 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
80042 ns |
76729.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74000 ns |
85708 ns |
0.86 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
210048 ns |
203053 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7715013 ns |
7420031.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
543000 ns |
532041.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
125206 ns |
125262 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
301875 ns |
274937.5 ns |
1.10 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
284709 ns |
306333 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
211791 ns |
314500 ns |
0.67 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
299770.5 ns |
291333 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1128416 ns |
1113767.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43276791 ns |
41752889.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6601000 ns |
6339625 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
702246 ns |
696159 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16958 ns |
16375 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17500 ns |
17166 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18667 ns |
17708.5 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
16292 ns |
16500 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
148481 ns |
149324.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5705815.5 ns |
5632259 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
524625 ns |
451041 ns |
1.16 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
239632 ns |
238583.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27313 ns |
25041.5 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
28000 ns |
27458.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26959 ns |
27208 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
25250.5 ns |
27417 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
984571.5 ns |
967445 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
39101861 ns |
42171811 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6098270.5 ns |
5985271 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
714026 ns |
714285 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11084 ns |
10541 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11417 ns |
10708 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13500 ns |
12167 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11083 ns |
10416 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
126073 ns |
124817.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3886871.5 ns |
3419436 ns |
1.14 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
831084 ns |
811375 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
239702 ns |
240213 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21542 ns |
21834 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
22083 ns |
21917 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21958 ns |
22667 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21667 ns |
22875 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
706566 ns |
693267 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
20349369 ns |
20616607 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5568417 ns |
5554812 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
687945 ns |
675879 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
62437.5 ns |
63875 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
63291.5 ns |
65458 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
66458 ns |
68750 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
62750 ns |
63291 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
107293.5 ns |
106862 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3660159 ns |
3236758 ns |
1.13 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1323062.5 ns |
1339728.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
239692 ns |
237523 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
490792 ns |
436417 ns |
1.12 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
443541 ns |
449729 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
450500 ns |
447750 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
437917 ns |
486125 ns |
0.90 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
517439 ns |
515853 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21639679 ns |
20960752 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6076541.5 ns |
6146771 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
731271.5 ns |
715734 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7208.5 ns |
7250.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7521 ns |
7041.5 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9042 ns |
8708 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7125 ns |
6916.5 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
146603.5 ns |
146046 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
6392989 ns |
5957852 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
458687.5 ns |
454417 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59111 ns |
59271 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14875 ns |
14771 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15062.5 ns |
15479 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14937.5 ns |
15062 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15125 ns |
14000 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
952699.5 ns |
942484.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
40798622 ns |
39118999 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5781813 ns |
5667729 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
405764 ns |
407846 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6157292 ns |
6158125.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
3225250 ns |
3218166 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
3226625 ns |
3227708 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11915500 ns |
11925375 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
350478 ns |
351461 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
296627.5 ns |
299264 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19132270.5 ns |
19150312.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
11022312 ns |
11075104 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
11088416 ns |
11106625 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36416791.5 ns |
36514875 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1067365 ns |
1053961 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1157365 ns |
1154031 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
917 ns |
958 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1042 ns |
1000 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
1041 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
959 ns |
958 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23582 ns |
23131 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2239100 ns |
2063993 ns |
1.08 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
288125 ns |
232479 ns |
1.24 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
214282 ns |
213903 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3667 ns |
3708 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3625 ns |
3667 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
282213 ns |
280249 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11258687 ns |
11110622 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2144000 ns |
2136458 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
648245 ns |
645129 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8625 ns |
8250 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8875 ns |
7791.5 ns |
1.14 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9667 ns |
9125 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8667 ns |
8396 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
122810 ns |
121638.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3572398.5 ns |
3248289.5 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
748125 ns |
788916 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
69920 ns |
67611 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11875 ns |
11729.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
13208.5 ns |
12271 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
13292 ns |
13459 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11791.5 ns |
11770.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
647865 ns |
639448 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22137350 ns |
20615290 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
4443083 ns |
5086271 ns |
0.87 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
365428 ns |
366630 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22406 ns |
22523 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2171056 ns |
2092333 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
227500 ns |
223666.5 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
53190 ns |
52621 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2958 ns |
2875 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3042 ns |
2959 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3416 ns |
3042 ns |
1.12 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2875 ns |
2875 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
204810 ns |
203283 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
10200655 ns |
9008155 ns |
1.13 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1731833 ns |
1643667 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
161696.5 ns |
171352 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11833 ns |
10209 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11792 ns |
11875 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13333 ns |
13000 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11000.5 ns |
11291 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
123504 ns |
122118 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3318538 ns |
3370469 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
955979 ns |
932041 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
238252 ns |
239973.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23083.5 ns |
20833 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
22292 ns |
20771 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21917 ns |
21541.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21417 ns |
22729 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
601694.5 ns |
592817 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21113099.5 ns |
20103668 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4708208 ns |
4792708 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
667256 ns |
667099 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4417 ns |
4416 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4417 ns |
4417 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4417 ns |
4458 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4416 ns |
4417 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24698 ns |
24053 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2115886.5 ns |
2139501 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
222604 ns |
223416 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
54070 ns |
54331 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16167 ns |
16292 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16417 ns |
16375 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16459 ns |
16375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16229.5 ns |
16312.5 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
332326 ns |
328788 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12612468 ns |
12357389.5 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1596583 ns |
1610333 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
216202 ns |
214938 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2084 ns |
2042 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2166 ns |
2042 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2083 ns |
2208 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2042 ns |
2000 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
36354 ns |
36532 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1200010 ns |
1144768 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
315833 ns |
338417 ns |
0.93 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
208602 ns |
206372 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
17583 ns |
17708.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
17687.5 ns |
19145.5 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
17500 ns |
18687.5 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
17375 ns |
17583.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
295580 ns |
294488 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20027287 ns |
21056777.5 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5448166 ns |
4806541.5 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
694207 ns |
704000 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
58875 ns |
61291.5 ns |
0.96 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
60625 ns |
60708 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
61083 ns |
61791 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51792 ns |
51625 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66673 ns |
66466 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
94791 ns |
97471 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
159167 ns |
193333 ns |
0.82 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
144395.5 ns |
132604 ns |
1.09 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
135645.5 ns |
153021 ns |
0.89 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
218917 ns |
255166.5 ns |
0.86 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
218558 ns |
218241 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
584845 ns |
583953 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
85042 ns |
83208 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82792 ns |
82958 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
84000 ns |
87041.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80708 ns |
86458 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190714 ns |
191093 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5317512.5 ns |
5412302 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2004833 ns |
1964604.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
171982 ns |
170373 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1918646 ns |
1871250 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1920146.5 ns |
1923625 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1922500 ns |
1926625 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1926417 ns |
1695083 ns |
1.14 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
537463 ns |
533673 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
25759235 ns |
27973144 ns |
0.92 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8892541.5 ns |
8716646 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1086039.5 ns |
1083959 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21729 ns |
21925 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2156001.5 ns |
2103570 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
340438 ns |
323625 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
46710 ns |
45100 ns |
1.04 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1833 ns |
1791 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
254431 ns |
253156 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9970916 ns |
9676564 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1487229 ns |
1486062.5 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
187061 ns |
183853 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8583 ns |
9250 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
10709 ns |
8791 ns |
1.22 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11104 ns |
11541.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8625 ns |
9833 ns |
0.88 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
120562 ns |
119759.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3392235 ns |
3304871 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
896875 ns |
911583 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
237707 ns |
242563 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10541 ns |
9084 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10708 ns |
9083 ns |
1.18 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10145.5 ns |
9875 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9583 ns |
10542 ns |
0.91 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
533774 ns |
528445 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21000394 ns |
20929851 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4276334 ns |
4465959 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
627816 ns |
649598 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58000 ns |
58187.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39375 ns |
39583 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39645.5 ns |
39833 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83709 ns |
83167 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40383 ns |
39718.5 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1402434 ns |
1335928 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1146125 ns |
1144792 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78591 ns |
76941 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1924459 ns |
1876458.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1974917 ns |
1982000 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1974604 ns |
1975334 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1871479 ns |
1876084 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
223824 ns |
223366 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33746868 ns |
33121559 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11596333 ns |
11069000 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1032529 ns |
1033133.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
418208 ns |
419792 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
419792 ns |
419416 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
425479 ns |
420417 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
424000 ns |
417833 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
214170 ns |
209830.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7394482 ns |
7621895 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
544334 ns |
539709 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
285852 ns |
287624 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
760250 ns |
670083 ns |
1.13 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
674416 ns |
762791.5 ns |
0.88 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
735937.5 ns |
739541 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
698562.5 ns |
764667 ns |
0.91 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1053639 ns |
1045546 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44748386 ns |
42506282 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6735604 ns |
6380125 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
919018.5 ns |
921656 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3464521 ns |
3366854.5 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3422854 ns |
3432979 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3395334 ns |
3458292 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3468645.5 ns |
3357375 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
175338 ns |
176639 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8033050 ns |
8129736 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1413708.5 ns |
1393270.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
428514 ns |
423500.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6215854.5 ns |
6223146 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6089541.5 ns |
6217459 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6227000 ns |
6240625 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6184187.5 ns |
6221312.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1007413 ns |
997179 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51999371 ns |
50529292.5 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7858354.5 ns |
8164709 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1564094 ns |
1566429.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
474958 ns |
473000 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
253500 ns |
254042 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
253250 ns |
254542 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
901666 ns |
902333 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46720 ns |
46242.5 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
389374 ns |
825428 ns |
0.47 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
425291 ns |
517333 ns |
0.82 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
250522 ns |
250313 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2250542 ns |
2279166.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1761750 ns |
1761750 ns |
1 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1761166 ns |
1764396 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3198959 ns |
3193125 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
269668 ns |
268875.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
8958333 ns |
13207390 ns |
0.68 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2163500 ns |
2166292 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
786867 ns |
784110 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57792 ns |
57375 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39333 ns |
39292 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39791 ns |
39541 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83417 ns |
83667 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28664 ns |
28000 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
978884 ns |
1420961.5 ns |
0.69 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1153500 ns |
1133895.5 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75511 ns |
78041 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2028042 ns |
1783500 ns |
1.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2056020.5 ns |
2087458 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2086000 ns |
2091417 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1945229 ns |
1973375 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
236349.5 ns |
235065 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
38351191 ns |
34323841 ns |
1.12 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11477729 ns |
11467646 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1054329.5 ns |
1053243 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58541 ns |
57500 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39875 ns |
39791 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
40167 ns |
39875 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82792 ns |
83333 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
50484.5 ns |
49753 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
809837 ns |
807009.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1111291 ns |
1110750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78201 ns |
71821 ns |
1.09 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1881500 ns |
1870083 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1941229.5 ns |
1974791.5 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1971250 ns |
1975458.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1896833 ns |
1719417 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
242923.5 ns |
242025 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
18110920.5 ns |
17950511 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9855750 ns |
9840104.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
930267 ns |
928181 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
35373 ns |
35044 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1269712 ns |
1224421 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
440854.5 ns |
279916 ns |
1.57 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47780 ns |
50520 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6791.5 ns |
6083 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6667 ns |
7041 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6959 ns |
7542 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6958 ns |
6583 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
215041.5 ns |
212138.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20712502.5 ns |
20858604.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4803791.5 ns |
4933020.5 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
375493 ns |
377125 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32582 ns |
32102 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1261691 ns |
1246143 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
254541.5 ns |
252500 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
43691 ns |
40121 ns |
1.09 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2958 ns |
3250 ns |
0.91 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3167 ns |
2833 ns |
1.12 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2958 ns |
3417 ns |
0.87 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2834 ns |
3166 ns |
0.90 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
189747.5 ns |
187793.5 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
8052543 ns |
7423467 ns |
1.08 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
938459 ns |
930666 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
162191.5 ns |
159252 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
450145.5 ns |
426395.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
446959 ns |
423458 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
429042 ns |
453437.5 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
422229 ns |
422541.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
138250 ns |
138012 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6236248.5 ns |
6078596 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2128729 ns |
2105875 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
373698.5 ns |
351154 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3793417 ns |
3627187.5 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3811000 ns |
3781646 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3814875 ns |
3818708.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3787042 ns |
3816750.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
714820 ns |
714220.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33262062 ns |
32708263 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10779334 ns |
10437208 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1498493.5 ns |
1330337 ns |
1.13 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49901250 ns |
49952500 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
25981417 ns |
25992042 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
25983500 ns |
25974771 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97079479.5 ns |
97060375 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1594678 ns |
1609718.5 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1014749 ns |
1005437.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154541375 ns |
154751187.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
88793000 ns |
88411625 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
88530458 ns |
89142125 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
294936604.5 ns |
295023146 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6471554 ns |
6525541 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5536819 ns |
5541499 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
17979 ns |
17458.5 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
15459 ns |
15417 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
13000 ns |
13916 ns |
0.93 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15146 ns |
15187 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
20648 ns |
20963 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1156334 ns |
1029086 ns |
1.12 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
224875 ns |
221417 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
27171 ns |
27290 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
11125 ns |
10625 ns |
1.05 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
7729 ns |
7687.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
7854.5 ns |
7895.5 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17250 ns |
17333.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
263885.5 ns |
262988 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
9825365 ns |
11032315 ns |
0.89 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1608208 ns |
1558750 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
152662 ns |
153002 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
9000 ns |
7917 ns |
1.14 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8771 ns |
8333.5 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10396 ns |
11125 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8833.5 ns |
8250 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
116927 ns |
116148 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3541234 ns |
3496720 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
800667 ns |
797854 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
240932 ns |
240663 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9729.5 ns |
10021 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10187.5 ns |
10083.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10167 ns |
10791.5 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9604 ns |
10584 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
626219.5 ns |
627842 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
26581249 ns |
22890536.5 ns |
1.16 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5185125.5 ns |
4718917 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
668776 ns |
670993.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10187.5 ns |
9271 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9792 ns |
9541 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10917 ns |
10875 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9292 ns |
9270.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
124324.5 ns |
122880.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3445385.5 ns |
3253148 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
931250 ns |
918333 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72601 ns |
73381 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13791 ns |
15083 ns |
0.91 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
15042 ns |
14167 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
16750 ns |
17042 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
14875 ns |
14667 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
599265 ns |
595348 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19837253 ns |
19444920 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4467250 ns |
4763896 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
354288 ns |
353084 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
666 ns |
459 ns |
1.45 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
459 ns |
459 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
35015 ns |
35417 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1242520 ns |
1186574 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
426916.5 ns |
416604 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
208692 ns |
209112 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8937.5 ns |
8979 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9375 ns |
10292 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
10416.5 ns |
0.86 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8917 ns |
8729.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
236066 ns |
233445.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21882713 ns |
21282401 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5349250 ns |
5435416.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
656976 ns |
676048 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
15792 ns |
15708 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
13416 ns |
14583 ns |
0.92 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
12583.5 ns |
12416 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10979 ns |
9937 ns |
1.10 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
22506 ns |
21468 ns |
1.05 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1159535 ns |
1188974 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
197166 ns |
204687 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
187121.5 ns |
182912 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32125 ns |
32083 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32125 ns |
31979 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32125 ns |
32583 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32042 ns |
31916 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
281987.5 ns |
277811 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11256812 ns |
11129104.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1704270.5 ns |
1607584 ns |
1.06 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
602756 ns |
603987 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
439208 ns |
443583 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
440312 ns |
441395.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
442437.5 ns |
443312.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
452250 ns |
439937.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194353 ns |
194190 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6187612 ns |
5958005 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2010833.5 ns |
1994958 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
371658.5 ns |
350285 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3836125 ns |
3816917 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3828416.5 ns |
3836875 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3833375 ns |
3840729.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3804958 ns |
3801625 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
543630.5 ns |
546260 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28316634 ns |
28675319 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9576666 ns |
9200208 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1217281 ns |
1220685 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
781279958 ns |
783919458 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
418024250 ns |
415090937.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
415003958 ns |
416149396 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1553302312.5 ns |
1556394646 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22534687 ns |
22758802.5 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14053357 ns |
14026629 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2540355333 ns |
2531412125 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1525674250 ns |
1503429375 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1510867083 ns |
1511972625 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
5211355166 ns |
5238183333 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
372139138 ns |
341968825.5 ns |
1.09 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88484108 ns |
89112141 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
76833.5 ns |
76084 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
80438 ns |
77666 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79375 ns |
79333 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
77000 ns |
88708 ns |
0.87 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
210860 ns |
209926 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7939257.5 ns |
7624963 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
556833 ns |
538459 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
110121 ns |
111431 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
194333 ns |
193479 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
196250 ns |
195396 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
280604.5 ns |
255791 ns |
1.10 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
209625 ns |
263084 ns |
0.80 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1044977.5 ns |
1056306 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44213920 ns |
42921068 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6328458 ns |
6096396 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
635831 ns |
638587 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
200007250 ns |
199996979.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
103851687 ns |
104048375 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
103904750 ns |
103857041 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
388866500 ns |
389154708 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5820988 ns |
5838520 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3429802 ns |
3416961 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
621011562.5 ns |
619738166.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
351243917 ns |
352609750 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
354523166 ns |
353140208 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1184086167 ns |
1179908250 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26473310 ns |
26709121 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21855057 ns |
21908376.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7250 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5416 ns |
5292 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
5375 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
9958 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28684 ns |
27949 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1197511 ns |
1220698 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
675667 ns |
445770.5 ns |
1.52 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48700 ns |
49951 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215000 ns |
214667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221959 ns |
222250 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221958.5 ns |
222083.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215917 ns |
217354.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
219373 ns |
226060 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
33465057 ns |
31786029 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9200792 ns |
9164667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
538594 ns |
535067 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9271 ns |
8208.5 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8708 ns |
7375 ns |
1.18 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10417 ns |
10708 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7562.5 ns |
8021 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
117805.5 ns |
118426 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3416067 ns |
3289610 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
906500 ns |
894959 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
74050 ns |
75861 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8500 ns |
8458.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9021 ns |
8458 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11583 ns |
11000 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8875 ns |
8625 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
511809 ns |
525710 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18851406 ns |
19476117 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4467000 ns |
4580750 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
318773 ns |
323803.5 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
709 ns |
542 ns |
1.31 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
709 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
25714 ns |
26694 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1273031 ns |
1232825 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
450458.5 ns |
334666 ns |
1.35 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
48810 ns |
51101 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16084 ns |
12833 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
12146 ns |
11125 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
12125 ns |
12708 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
11334 ns |
11375 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
250965 ns |
255932 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23303938.5 ns |
23574064 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5365646 ns |
5957187 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
389799 ns |
393244 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
106291 ns |
106916 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
84625 ns |
84416 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
86166 ns |
85416 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146500 ns |
146729 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
24955 ns |
24228 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1163867 ns |
1231419 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
262458 ns |
259958 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
185231.5 ns |
188842 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
479417 ns |
478583.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
519521 ns |
479354.5 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
481771 ns |
479416 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
504437.5 ns |
522125 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
230703 ns |
234731 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11688664.5 ns |
11445580 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2205416 ns |
2175521 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
617466 ns |
622217.5 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5875 ns |
5208 ns |
1.13 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
7333 ns |
6958 ns |
1.05 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7000 ns |
7167 ns |
0.98 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
6312.5 ns |
5020.5 ns |
1.26 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
15960 ns |
17348 ns |
0.92 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
79085.5 ns |
79231 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
13208 ns |
12875 ns |
1.03 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10667 ns |
12083 ns |
0.88 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11167 ns |
12667 ns |
0.88 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
17083.5 ns |
17708.5 ns |
0.96 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
211295 ns |
217078.5 ns |
0.97 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
373923 ns |
388595 ns |
0.96 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39209 ns |
39625 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
50708 ns |
50625 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
51083 ns |
51000 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13541.5 ns |
13666.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
21656 ns |
20461 ns |
1.06 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
80316 ns |
83341 ns |
0.96 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
37833 ns |
38542 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
32083 ns |
29917 ns |
1.07 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
30083 ns |
31417 ns |
0.96 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
57250 ns |
66000 ns |
0.87 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
189684 ns |
195656.5 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
406643.5 ns |
398885 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1916.5 ns |
1770.5 ns |
1.08 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
1625 ns |
1.15 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2125 ns |
2292 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1791.5 ns |
1729.5 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20698 ns |
21146 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1151725 ns |
1123716.5 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
318416.5 ns |
302958 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
30320 ns |
28491 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2208.5 ns |
2229.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2416 ns |
0.90 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2375 ns |
2375 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2166 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
201882 ns |
205300 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
8916839 ns |
9074561 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1544750 ns |
1516937.5 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
138336.5 ns |
138212 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6271 ns |
5645.5 ns |
1.11 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4854.5 ns |
4771 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6312.5 ns |
6604 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5042 ns |
4979.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
143284 ns |
147775 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5900971 ns |
6128313 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
567291.5 ns |
450875 ns |
1.26 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
62250 ns |
62371 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8958.5 ns |
8958 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9250 ns |
8750 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9438 ns |
9125 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8687.5 ns |
9625 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
864087 ns |
883717 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
38587534 ns |
41518756 ns |
0.93 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5770625 ns |
5658500 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
391113 ns |
388034 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56750 ns |
56709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
56916 ns |
56833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57000 ns |
56917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58250 ns |
58292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37169 ns |
38043 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1177949 ns |
1221995 ns |
0.96 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
363395.5 ns |
611541 ns |
0.59 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
207172 ns |
207452 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
451792 ns |
450937.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
467500 ns |
466917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
466312.5 ns |
468562.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
441500 ns |
473167 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
264130.5 ns |
271371 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27819058 ns |
26618792 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8272125 ns |
8082167 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
816138 ns |
807824 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3324375 ns |
3309813 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1763125 ns |
1763625 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
1769958 ns |
1772167 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6313291.5 ns |
6307500 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
205480 ns |
206270.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
205832 ns |
211692.5 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11532979 ns |
11489208 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
6556937.5 ns |
6543312.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
6559812.5 ns |
6593875 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21146833 ns |
21174666.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
740245 ns |
735714 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1073505 ns |
1071922.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6083 ns |
6437 ns |
0.95 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4917 ns |
5125 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
7604.5 ns |
0.79 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5209 ns |
6021 ns |
0.87 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
137073 ns |
141217 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5763070.5 ns |
5736528 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
761979.5 ns |
743958 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
58111 ns |
58020 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7166 ns |
7750 ns |
0.92 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
13625 ns |
8791 ns |
1.55 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7416 ns |
7417 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7146 ns |
8084 ns |
0.88 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
748959 ns |
759240 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
37270377 ns |
35174267 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5566708.5 ns |
5288042 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
379893 ns |
379024.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
97625 ns |
97583 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
95708 ns |
101959 ns |
0.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
97708 ns |
127542 ns |
0.77 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
122083 ns |
96084 ns |
1.27 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
149717.5 ns |
153040 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5815106.5 ns |
5764876 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2046458 ns |
2076375 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
186092 ns |
184732 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2032063 ns |
1822416 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2035417 ns |
2035833.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2025750 ns |
2031521 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2034812.5 ns |
2029667 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
699402 ns |
712381 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33359753 ns |
32235030 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10780625 ns |
10817667 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1123756 ns |
1119068 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
32895.5 ns |
32771 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
35958 ns |
34958 ns |
1.03 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
32292 ns |
33834 ns |
0.95 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
625 ns |
584 ns |
1.07 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15283 ns |
16070 ns |
0.95 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
80500 ns |
80701 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2625 ns |
2645.5 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3291 ns |
4250 ns |
0.77 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3042 ns |
3083 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2334 ns |
2979.5 ns |
0.78 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
137331.5 ns |
140484 ns |
0.98 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
346583 ns |
362954 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7166 ns |
7250 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
5333 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5416 ns |
5375 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
10167 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36591 ns |
37558 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1255336 ns |
1203117.5 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
361167 ns |
351958 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48660 ns |
50591 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212187 ns |
215229 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221041.5 ns |
223042 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221312 ns |
221041.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206750 ns |
216292 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
242479 ns |
247737.5 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26291133.5 ns |
28210154.5 ns |
0.93 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8127688 ns |
7826917 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
523004.5 ns |
518941 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
4000 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3959 ns |
3958 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3959 ns |
3959 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21550 ns |
22280 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2157769 ns |
2135337 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
248541 ns |
244750 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
45911 ns |
45821 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14667 ns |
14708 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14708 ns |
14708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14750 ns |
14750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14708 ns |
14708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
304620 ns |
313766.5 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
10977767 ns |
11565919.5 ns |
0.95 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1043625 ns |
996417 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
199612 ns |
196698 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
106000 ns |
102375 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
98291.5 ns |
98375 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
102542 ns |
130667 ns |
0.78 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
128833 ns |
101541 ns |
1.27 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
135179 ns |
142696 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5987975 ns |
6012180 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2089312.5 ns |
2060042 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
186671 ns |
185242 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1921917 ns |
1678708 ns |
1.14 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1911646 ns |
1919562.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1921417 ns |
1925646 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1918250 ns |
1715750 ns |
1.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
683849 ns |
697882 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32200299 ns |
32586423 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10813854.5 ns |
10270770.5 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1072735 ns |
1227914 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17875 ns |
20125 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18042 ns |
18666 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21499.5 ns |
20125 ns |
1.07 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18541 ns |
19041.5 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
107857 ns |
111256 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3405654 ns |
3316785.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1342500 ns |
1342375 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
81381 ns |
77136 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216416.5 ns |
216708 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
226729 ns |
217270.5 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217666.5 ns |
217000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
228458.5 ns |
257500 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
513449.5 ns |
522548.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19313650 ns |
19703098 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5992125 ns |
6106875 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
473434 ns |
495696 ns |
0.96 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
23937.5 ns |
23625 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
28875 ns |
28917 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
26500 ns |
27167 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1416 ns |
1542 ns |
0.92 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
15770 ns |
16593 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
82311 ns |
83321 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4833 ns |
4937.5 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5000 ns |
4709 ns |
1.06 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5166 ns |
5125 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4625 ns |
5479 ns |
0.84 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
205185 ns |
210967 ns |
0.97 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
383233 ns |
384204.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
307000 ns |
304709 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
307333 ns |
305417 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
309000.5 ns |
307312.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
306959 ns |
304999.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
227362 ns |
231440.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7656672 ns |
7899776.5 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
650375 ns |
1048666.5 ns |
0.62 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
275572 ns |
278713 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
537562 ns |
531667 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
532667 ns |
537916 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
535625 ns |
559833 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
542458 ns |
535042 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1070334 ns |
1077983 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44223125 ns |
46672590 ns |
0.95 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6462771 ns |
6185542 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
869258 ns |
867079 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19500 ns |
21000 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19958 ns |
19792 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23375 ns |
21333.5 ns |
1.10 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19958 ns |
20125 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112543.5 ns |
115430.5 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3491900 ns |
3543630 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1414625 ns |
1426729 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
77381 ns |
77991 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213208 ns |
212667 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213479 ns |
214292 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215042 ns |
213916 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214958 ns |
219958 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
750498 ns |
758463 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
24681158 ns |
25339852 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7223895.5 ns |
7150812.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
544035 ns |
549146 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6417 ns |
6666 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6875 ns |
7000.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8292 ns |
8374.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6500 ns |
6396 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
138363 ns |
144368 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5558527 ns |
5600145 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
777187 ns |
781083 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
69061 ns |
69300 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10375 ns |
10917 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10291 ns |
10041.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10708.5 ns |
10791 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9709 ns |
11250.5 ns |
0.86 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
819633 ns |
829126 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39220243 ns |
38035335 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5518708 ns |
5400125 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
385803 ns |
389489 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5958 ns |
6333 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4875 ns |
5291 ns |
0.92 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6917 ns |
7042 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4729.5 ns |
4562.5 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
142280 ns |
146644 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5842056.5 ns |
5614464 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
769459 ns |
767750 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
59561 ns |
60400 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7209 ns |
7583 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7666 ns |
7750 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7917 ns |
7625 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7583 ns |
8625 ns |
0.88 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
778165 ns |
788273 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
39496840.5 ns |
39532384.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5854688 ns |
5788792 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
404144 ns |
390144 ns |
1.04 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14575750 ns |
14512959 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
7731500 ns |
7746083 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
7698583 ns |
7719437.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27811541 ns |
27824167 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
535283 ns |
532712 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
407233 ns |
405110 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46519437 ns |
46254125 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
26552479.5 ns |
26514813 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
26436334 ns |
26596375 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85626334 ns |
85595417 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2913979.5 ns |
2648732 ns |
1.10 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3300841 ns |
3291677 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
66500 ns |
69916 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
66709 ns |
66666.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
68312.5 ns |
67604 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
67500 ns |
69812.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
105648 ns |
119643.5 ns |
0.88 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3451369.5 ns |
3502655.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1470250.5 ns |
1447479.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
234332 ns |
236773 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
440250 ns |
480313 ns |
0.92 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
441125 ns |
447125 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
445625 ns |
447937.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
442624.5 ns |
444459 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
729654 ns |
735182 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26852660 ns |
27836501.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7754417 ns |
7344541.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
803477.5 ns |
795239 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
625 ns |
0.80 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32133 ns |
32854 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1168342.5 ns |
1222475 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
351479 ns |
464063 ns |
0.76 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
49250 ns |
50950 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8875 ns |
8250 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9271 ns |
8687.5 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
8667 ns |
9646 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9083 ns |
15771 ns |
0.58 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
283467 ns |
289332 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
22237807 ns |
22396972 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5030812.5 ns |
5647520.5 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
384844 ns |
389255 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9834 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9834 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9875 ns |
9875 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9834 ns |
9791 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23024 ns |
23549 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2093062 ns |
2127803 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
223166 ns |
223688 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
216402 ns |
215812 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45750 ns |
45583 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45583 ns |
45833 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46000 ns |
45834 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45750 ns |
45792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
285399.5 ns |
292557 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
9799339 ns |
11637949 ns |
0.84 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
968750 ns |
1005416 ns |
0.96 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
625876 ns |
620161.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56250 ns |
56250 ns |
1 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
56458 ns |
56375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
56459 ns |
56458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57917 ns |
57750 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28644 ns |
29238.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1187526 ns |
1197390 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
631292 ns |
658208 ns |
0.96 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
205262 ns |
204172 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
459458 ns |
451791.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
465375 ns |
471500 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
497666.5 ns |
468000 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
476896 ns |
441791.5 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
243906 ns |
250364.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
33120338 ns |
32745444 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9379625 ns |
10042062.5 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
852638 ns |
848179.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
586000 ns |
581125.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
645146 ns |
649645.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
591042 ns |
657583 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
660999.5 ns |
614250 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
206101 ns |
209963 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8668014 ns |
8555661.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1370250 ns |
1375959 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
238977 ns |
264153 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2245208 ns |
2243542 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2238291.5 ns |
2233479 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2233812.5 ns |
2247312 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2238042 ns |
2249041 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
956767.5 ns |
981693 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48728968 ns |
47646947 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7240916.5 ns |
7438458 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1384018 ns |
1260099 ns |
1.10 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19458 ns |
25000 ns |
0.78 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19687.5 ns |
19625.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22667 ns |
21959 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
31250 ns |
19167 ns |
1.63 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
111255 ns |
114255 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3455101.5 ns |
3641620.5 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1420333.5 ns |
1425646 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78081 ns |
82081 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225333 ns |
256541.5 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221583 ns |
220250 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228792 ns |
221687.5 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
232292 ns |
221750 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
721610 ns |
733642 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25857434.5 ns |
27659496.5 ns |
0.93 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7765792 ns |
7468958 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
566570 ns |
559661.5 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
541 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22767 ns |
23294 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1215266.5 ns |
1199626 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
452375 ns |
380395.5 ns |
1.19 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
52441 ns |
50321 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9500 ns |
9083.5 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10271 ns |
10167 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10625 ns |
10271 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9875 ns |
11333 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
265487 ns |
269037 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
25062592 ns |
25065409.5 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6018666 ns |
5606334 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
420344 ns |
414904 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10500 ns |
8583 ns |
1.22 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8062.5 ns |
8458 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10292 ns |
10458 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9042 ns |
7625 ns |
1.19 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
118808 ns |
121505 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3418649 ns |
3438400 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
885187.5 ns |
884250 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
71665.5 ns |
69061 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7542 ns |
7333.5 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7875 ns |
7542 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7916.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7417 ns |
8000 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
505134.5 ns |
512016 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17565081 ns |
18614285.5 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4294625 ns |
4265271 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
329113 ns |
331073.5 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1541 ns |
1334 ns |
1.16 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1708 ns |
1625 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1895.5 ns |
2000 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1375 ns |
1458 ns |
0.94 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
21715 ns |
20878 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1184887 ns |
1144746 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
308875 ns |
305042 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
187651 ns |
191532 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3375 ns |
3375 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3375 ns |
3375 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3542 ns |
3708.5 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3291 ns |
3458 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
219260 ns |
220885.5 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10744223.5 ns |
10272002 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1724187.5 ns |
1658437.5 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
593095 ns |
594146 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
148145.5 ns |
149042 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
106334 ns |
106104 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
107187.5 ns |
107459 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
233354 ns |
225625 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
23884 ns |
24697 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1182223 ns |
1197055 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
300000 ns |
300625 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
36950 ns |
38181 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
144520.5 ns |
144084 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
87687 ns |
100709 ns |
0.87 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
87792 ns |
87937.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
251833 ns |
263895.5 ns |
0.95 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
216029 ns |
219366 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10660743 ns |
11143376 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2107666.5 ns |
2064125 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
239532 ns |
226117.5 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7167 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5292 ns |
5333 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5334 ns |
5334 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10167 ns |
10292 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32714 ns |
33744 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1156018.5 ns |
1208626.5 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
352875 ns |
394645.5 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
53221 ns |
50650 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220062.5 ns |
220458.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228729.5 ns |
236458 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228833 ns |
229542 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
224271 ns |
213437 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
260760 ns |
266362.5 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27980428 ns |
26810792 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8578229 ns |
8119062.5 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
534445 ns |
532916 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
14917 ns |
15250 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15312.5 ns |
14812.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
16708 ns |
16792 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
14834 ns |
15292 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
139169 ns |
142309 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5708443 ns |
5521569 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
797834 ns |
788458 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
241323 ns |
239123 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23646 ns |
23209 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
23812 ns |
24208 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23958 ns |
24104.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23709 ns |
23500 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
860230 ns |
874682 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38347185 ns |
39249650.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5856625 ns |
5835021 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
700376.5 ns |
702463 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8834 ns |
10062.5 ns |
0.88 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9792 ns |
9792 ns |
1 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11083 ns |
11375 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8916 ns |
9250 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
122964 ns |
124966.5 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3472918 ns |
3573835 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
908125 ns |
826250 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
74011 ns |
71705.5 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14500 ns |
13250 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14146 ns |
14021 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15354 ns |
14833 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14125 ns |
14250 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
660283 ns |
673097 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21362028 ns |
21882593 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5340833 ns |
5231334 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
375084 ns |
372554 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9687.5 ns |
10083.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10520.5 ns |
9333 ns |
1.13 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11625 ns |
10917 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9583 ns |
9791 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
121557 ns |
124389 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3413782 ns |
3411999.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
932500 ns |
932500 ns |
1 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
71111 ns |
71241 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12791 ns |
12625 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12500 ns |
12625 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13708 ns |
13313 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12166 ns |
12375 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
548626 ns |
557333 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20221743.5 ns |
19402473 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4648729 ns |
4633542 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
350268.5 ns |
348154 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
29604.5 ns |
29708 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
31542 ns |
31750 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
30375 ns |
29667 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1833 ns |
1834 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
15946 ns |
16586 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
74191 ns |
74511 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5125 ns |
5292 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
4791.5 ns |
4542 ns |
1.05 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5291.5 ns |
5375 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6375 ns |
6667 ns |
0.96 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
138206 ns |
142234 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
374353 ns |
371734 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25381 ns |
26130 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1154493.5 ns |
1255720 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
446500 ns |
468750 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48930 ns |
48500 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6209 ns |
6542 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6792 ns |
6542 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6708 ns |
6583 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6167 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
184823.5 ns |
190203.5 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23775568 ns |
23758213 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5401167 ns |
5392792 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
393993 ns |
393904 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
1958 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2000 ns |
2000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2083 ns |
2084 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
2000 ns |
1958 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
25927 ns |
27189 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1174327 ns |
1199767 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
313812.5 ns |
312750.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
208652 ns |
208272 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16875 ns |
15916.5 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16666 ns |
16291 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16291.5 ns |
16979 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16687.5 ns |
16312.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
271953.5 ns |
276740.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
28538231.5 ns |
24755518 ns |
1.15 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5705375 ns |
5979167 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
711016.5 ns |
715538 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148084 ns |
180833 ns |
0.82 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
164437 ns |
151333.5 ns |
1.09 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
150583.5 ns |
179000 ns |
0.84 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
184958 ns |
147562.5 ns |
1.25 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
198930 ns |
207596 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7772893 ns |
7810338 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1453625 ns |
1464083.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
196832 ns |
195132 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1306854 ns |
1308625 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1304812.5 ns |
1320417 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1334500.5 ns |
1326167 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1335563 ns |
1318250 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
896336.5 ns |
915789.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
44103385 ns |
47829317 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6551250 ns |
6477041 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1123231 ns |
1020372 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26000 ns |
26333 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25229 ns |
24750 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27479.5 ns |
27709 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24791 ns |
29458.5 ns |
0.84 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
235714.5 ns |
237299.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8360389 ns |
7668370 ns |
1.09 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
618125 ns |
1182167 ns |
0.52 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
106221 ns |
121321 ns |
0.88 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
180291.5 ns |
181812.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
119292 ns |
118083 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
119104.5 ns |
129000 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
133396 ns |
118458 ns |
1.13 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1061050 ns |
1085787 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
47965532 ns |
43559074 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6177667 ns |
6188875 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
624876 ns |
603482 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
291 ns |
333 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22572 ns |
23112 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1251092 ns |
1222588 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
324125 ns |
395646 ns |
0.82 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48860.5 ns |
48781 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6458 ns |
6042 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6708.5 ns |
6833 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7020.5 ns |
6729.5 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6479.5 ns |
6354 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
201712.5 ns |
206261.5 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24190754 ns |
24411973 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5519229 ns |
5650084 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
393274 ns |
392024.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7083 ns |
6166 ns |
1.15 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6479.5 ns |
5417 ns |
1.20 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8375 ns |
8250 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6334 ns |
6416 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
144445.5 ns |
148283 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5784772 ns |
5523038 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
451791 ns |
469750 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
237723 ns |
237302 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9708.5 ns |
10354 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10354.5 ns |
10166 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10312.5 ns |
10291 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9958 ns |
10125 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
894173 ns |
909984 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
41046274 ns |
43302207 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6098750 ns |
5927833 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
677436.5 ns |
689088 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
708 ns |
0.88 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
708 ns |
625 ns |
1.13 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22187 ns |
22992 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2028286 ns |
2053209 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
228479.5 ns |
227000 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
214902 ns |
215913 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4583 ns |
4584 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4625 ns |
4708 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4875 ns |
4625 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4584 ns |
4583 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
222465 ns |
228362.5 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10522858 ns |
10246488 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1645875 ns |
1762500 ns |
0.93 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
596496 ns |
596946 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8229.5 ns |
8791.5 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
9083.5 ns |
8021 ns |
1.13 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10208.5 ns |
10208.5 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7834 ns |
8625 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
121070 ns |
123762 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3530511 ns |
3582537 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
831083 ns |
795292 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
70060 ns |
70171 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8500 ns |
8500 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8958 ns |
9084 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9333 ns |
9291 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8375 ns |
8250 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
586511 ns |
599222 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21323888.5 ns |
22439265 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4802708.5 ns |
4920229 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
354733 ns |
352418.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
125292 ns |
126375 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
96708 ns |
96167 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
97250 ns |
96396 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
183416 ns |
183208 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
45670 ns |
46448 ns |
0.98 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
99990.5 ns |
94021 ns |
1.06 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
302791 ns |
302354.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
168083 ns |
168625 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
166833 ns |
178500 ns |
0.93 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
607229.5 ns |
568625 ns |
1.07 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
189831.5 ns |
193426.5 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
489695 ns |
485945.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398375 ns |
398500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
215333 ns |
214958 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215125 ns |
215459 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756459 ns |
755958 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43130 ns |
43652 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1398407.5 ns |
1354730.5 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
412042 ns |
489291.5 ns |
0.84 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
83571 ns |
83401 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1405604.5 ns |
1416708 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
863250 ns |
861208 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
861479.5 ns |
863229.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2358542 ns |
2359083 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
249090 ns |
249519.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10996775 ns |
11581786 ns |
0.95 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1820250 ns |
1843542 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
355383 ns |
354834 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
611208 ns |
651104 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
648500 ns |
636792 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
648812 ns |
662104.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
662875 ns |
581792 ns |
1.14 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194388.5 ns |
204117 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8240834 ns |
7983269 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1397562 ns |
1360250 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
254103 ns |
255778 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2466021 ns |
2460458 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2458875 ns |
2454583 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2463604.5 ns |
2468375 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2452250 ns |
2463875 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
981852.5 ns |
992828 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51226623.5 ns |
53061666.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7566875 ns |
7675854 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1486799.5 ns |
1495551 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
32542 ns |
32562.5 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
34750 ns |
34584 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
32229.5 ns |
32583.5 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
917 ns |
833 ns |
1.10 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15560 ns |
15923 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
78491 ns |
73991 ns |
1.06 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3083 ns |
3145.5 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3479.5 ns |
3416 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3334 ns |
3458.5 ns |
0.96 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3125 ns |
3084 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
136477.5 ns |
139769 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
359243 ns |
346409 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
407250 ns |
407417 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
402125 ns |
401791 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
401833 ns |
401916 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
421584 ns |
421167 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43081.5 ns |
43360 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1443601.5 ns |
1424417 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1160541.5 ns |
1149708 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
242377.5 ns |
244183 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3877250 ns |
3883958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3991438 ns |
3996708.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3995500 ns |
3992125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3778791.5 ns |
3780895.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
240481 ns |
246111 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36046095 ns |
36934379 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11740520.5 ns |
11631750 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1247192 ns |
1246158.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3916 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3959 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3916 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33151 ns |
33757 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1246525 ns |
1234748.5 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
178083 ns |
181500.5 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
40841 ns |
43060 ns |
0.95 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15459 ns |
15500 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15708 ns |
15583 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15792 ns |
15666 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15584 ns |
15541 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
250190 ns |
256020 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
9448692 ns |
10686428 ns |
0.88 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
867250 ns |
870458 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
170041 ns |
178281 ns |
0.95 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404041 ns |
404000 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
221437.5 ns |
220792 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
221041 ns |
221375 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760667 ns |
760833 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
112818 ns |
113651 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1033657 ns |
1020025 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
396500 ns |
412687.5 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
90181 ns |
91036 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1428625 ns |
1438417 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
887375 ns |
887125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
886333 ns |
888167 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2382896 ns |
2384958 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
235417.5 ns |
242637 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
9699881 ns |
9528776 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1899708.5 ns |
1851667 ns |
1.03 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
356683 ns |
357334 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
25300 ns |
25949.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1213264.5 ns |
1192514 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
303583 ns |
296583.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
208412 ns |
211622 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7208 ns |
7083 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7833 ns |
8000 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7750 ns |
7854.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7625 ns |
7333 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
208958.5 ns |
216752 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
26254851.5 ns |
24950983 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5627271 ns |
5888042 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
691516 ns |
701642.5 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
828417 ns |
813667 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
465812.5 ns |
465792 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
471166.5 ns |
467791 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1541979 ns |
1544375 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130118 ns |
132054 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
178896.5 ns |
162431 ns |
1.10 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2704041 ns |
2686208 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1527521 ns |
1528708 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1546750 ns |
1538542 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4937042 ns |
4933917 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
239281 ns |
240514 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
775748 ns |
859970 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
333 ns |
0.88 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31418 ns |
32094 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1183060 ns |
1252325 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
307687.5 ns |
323021 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48455.5 ns |
48681 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6125 ns |
5917 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6708.5 ns |
6333 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6562.5 ns |
6792 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6312.5 ns |
6083 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
222050 ns |
223941.5 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22089679 ns |
23466112 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5038937.5 ns |
5053625 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
368804 ns |
369274 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2384708 ns |
2397083 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2406334 ns |
2379291 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2401187.5 ns |
2394625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2400334 ns |
2379250 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
198668 ns |
200806.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7953848.5 ns |
8223452 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1483208 ns |
1521917 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
357533 ns |
359128.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4652749.5 ns |
4667500 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4657895.5 ns |
4598667 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4677042 ns |
4663834 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4656375 ns |
4654084 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
891976 ns |
896769 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50384184.5 ns |
49138075.5 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6325542 ns |
6734812.5 ns |
0.94 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1259107 ns |
1407615 ns |
0.89 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6792 ns |
7479.5 ns |
0.91 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7000 ns |
7125 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6917 ns |
7125 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7375.5 ns |
8020.5 ns |
0.92 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23575 ns |
23691.5 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1197552 ns |
1204234 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
263166 ns |
260979.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
37341 ns |
33710 ns |
1.11 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
52604 ns |
44792 ns |
1.17 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
45604 ns |
33042 ns |
1.38 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
49875.5 ns |
33459 ns |
1.49 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
66000.5 ns |
71791.5 ns |
0.92 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
219530 ns |
217114 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
11373243 ns |
10571611 ns |
1.08 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2074458 ns |
2004833 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
240673 ns |
241352 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
20750 ns |
20458.5 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
24750 ns |
24625 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
22083.5 ns |
22625 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5958 ns |
6041 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
16981 ns |
17905 ns |
0.95 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
86491 ns |
86031 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12041 ns |
11958 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
9333 ns |
9417 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
9625 ns |
9500 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18083 ns |
18000 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
230179 ns |
230114.5 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
380594 ns |
377559 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406208 ns |
406625 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
223541 ns |
223250 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
223145.5 ns |
223833 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762667 ns |
762833 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46914 ns |
46575 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1384166 ns |
1399200.5 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
415875 ns |
406583 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89301 ns |
89521 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1427084 ns |
1445834 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
891979 ns |
892854.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
891958 ns |
893333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2386312.5 ns |
2385770.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
287696.5 ns |
281827 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12534990 ns |
11465517 ns |
1.09 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2042416 ns |
2034937.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
375789 ns |
378964 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
433959 ns |
434333 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
430208 ns |
430667 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
430208 ns |
430166 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
447500 ns |
447292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
55750 ns |
55027 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
987998 ns |
1009771.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1135146 ns |
1109791.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
236932 ns |
236872 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3911708 ns |
3915542 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4023250 ns |
4022187.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4023416 ns |
4023854 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3815521 ns |
3802354 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
261796.5 ns |
265046 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33894952 ns |
31022310 ns |
1.09 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10609979 ns |
10484042 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1239582 ns |
1238903.5 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8750 ns |
8792 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
6875 ns |
6916 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
6916 ns |
6875 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12459 ns |
12458 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
23476 ns |
23854 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2195009 ns |
2189159 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
226375 ns |
227167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
218422 ns |
216382 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
44667 ns |
44833 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
44875 ns |
45000 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45416 ns |
45083 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
44834 ns |
44750 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
341928 ns |
339090 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11449609 ns |
13813520 ns |
0.83 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1758917 ns |
1746834 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
662766 ns |
671917 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
85687.5 ns |
87063 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82125 ns |
92271 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
88250 ns |
125250 ns |
0.70 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
87750.5 ns |
88396 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190673 ns |
189900.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5914536.5 ns |
5870133 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1998792 ns |
1961729.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
208012 ns |
204047 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2027062.5 ns |
2028417 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2018395.5 ns |
2022208.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2022916.5 ns |
2025000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2027750 ns |
2024000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
529341.5 ns |
536109.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27514758 ns |
30231842.5 ns |
0.91 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9494875 ns |
9333542 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1104271 ns |
1104742 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
1f16397
to
1a3d7fa
Compare
790b513
to
38f9941
Compare
Need to remove the Manifest before mergingrelies on feat!: 1.0 release LuxCore.jl#43persistent tasks will pass once the above PR is mergedhandle Error on computing gradients when--> we explicitly mention removing SemVar in this case, we will remove support for that case, once Check if function is being called insidetraining isa Val{false}
#98autodiff
EnzymeAD/Enzyme.jl#1761 is tagged.