-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: try re-enabling enzyme testing on 0.13.14 #1042
base: main
Are you sure you want to change the base?
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
Need to also reenable some of the tests manually in LuxLib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 239783f | Previous: 900c21c | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4084 ns |
4270.5 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4542 ns |
4000 ns |
1.14 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6166.5 ns |
5875 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4625 ns |
4895.5 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
61042.5 ns |
59833 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10812.5 ns |
10375 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9958 ns |
9958 ns |
1 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10167 ns |
10792 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10666 ns |
10125 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
426414 ns |
422438 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1041 ns |
1083 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3125 ns |
1000 ns |
3.13 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1334 ns |
1417 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1292 ns |
1125 ns |
1.15 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18161 ns |
18109 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4125 ns |
4166 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3958 ns |
4125 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4209 ns |
4187.5 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4208 ns |
4042 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
109971 ns |
109209 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57292 ns |
57645.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46667 ns |
47000 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46375 ns |
38125 ns |
1.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83666 ns |
82084 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36679 ns |
37455 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2034750 ns |
1973687 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2090625.5 ns |
2089416 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1873875 ns |
2085625 ns |
0.90 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2023041 ns |
1985813 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
194726.5 ns |
195917 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
149333 ns |
146416.5 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144312.5 ns |
147020.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
147584 ns |
145667 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
143708 ns |
145604.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167008 ns |
166391 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1121958 ns |
1129209 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1117416.5 ns |
1126375 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1106854 ns |
1147667 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1120042 ns |
1104209 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
526749 ns |
521058.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3500 ns |
3416.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3834 ns |
3333 ns |
1.15 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7083 ns |
6333 ns |
1.12 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3750 ns |
3250 ns |
1.15 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
70801 ns |
66594 ns |
1.06 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9395.5 ns |
8792 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9375 ns |
9291 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9250 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9959 ns |
9292 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
496293.5 ns |
493812 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15270.5 ns |
14750 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16125 ns |
15458 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19042 ns |
19167 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15666.5 ns |
16437.5 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
52891 ns |
53833 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213709 ns |
215416.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
226958 ns |
213208.5 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214875 ns |
214271 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214500 ns |
227104 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
269454 ns |
271460 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
458 ns |
542 ns |
0.85 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
583 ns |
625 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
750 ns |
792 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
583 ns |
1.14 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17740 ns |
17470 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1542 ns |
1750 ns |
0.88 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1458 ns |
1417 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1542 ns |
1709 ns |
0.90 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1542 ns |
1645.5 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
101893 ns |
101826.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7250 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5792 ns |
5916 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5792 ns |
5292 ns |
1.09 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10584 ns |
10000 ns |
1.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23410 ns |
23857.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225104 ns |
226895.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232958 ns |
230375 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229417 ns |
231584 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
222145.5 ns |
258625 ns |
0.86 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
167116 ns |
167659 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3875 ns |
3916 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3916 ns |
3833 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23806 ns |
23468 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16750 ns |
16750 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16667 ns |
17042 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16875 ns |
17000 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16750 ns |
16625 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
162865 ns |
160597 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
577167 ns |
572166 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
599291 ns |
575000 ns |
1.04 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
576875 ns |
587458 ns |
0.98 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
580250 ns |
578334 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
112788 ns |
113397 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1418584 ns |
1421708 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1445541.5 ns |
1420125 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1419458 ns |
1430083 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1427395.5 ns |
1413292 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
213449 ns |
209669.5 ns |
1.02 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1072354 ns |
1074458 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
961396 ns |
958250.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1354458 ns |
1334396 ns |
1.02 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1292937.5 ns |
1310875 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
272319.5 ns |
269120.5 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5988583 ns |
5769437 ns |
1.04 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4525584 ns |
4470625 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4903583 ns |
4941021 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5547917 ns |
5552042 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1062959 ns |
1066489 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
583 ns |
500 ns |
1.17 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23622 ns |
23585 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2083 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2167 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2166 ns |
2250 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2208 ns |
2125 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
175399.5 ns |
169900 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5833.5 ns |
4084 ns |
1.43 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4833.5 ns |
6250 ns |
0.77 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7291 ns |
7209 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4709 ns |
6125 ns |
0.77 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
65926.5 ns |
64199 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11875 ns |
11083 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11792 ns |
11625 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12084 ns |
12000 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11417 ns |
10917 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
449188 ns |
446167.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7458 ns |
6042 ns |
1.23 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7000 ns |
7042 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8083 ns |
8833 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6875 ns |
7250 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
50587 ns |
51074.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16500 ns |
17292 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18291 ns |
18334 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18167 ns |
18083 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17709 ns |
17229.5 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
297864.5 ns |
299895.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
32016 ns |
32630 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
8458 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8833.5 ns |
9041 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9541 ns |
9166 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8959 ns |
8459 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
156346 ns |
158907 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64625 ns |
64625 ns |
1 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64542 ns |
64250 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64542 ns |
65000 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
65042 ns |
64667 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111541.5 ns |
111460 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
292625 ns |
289667 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
280333 ns |
279750 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
277312.5 ns |
289625 ns |
0.96 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
290709 ns |
281250 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
186964 ns |
184453.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3370271 ns |
3347125 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3013792 ns |
3015520.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3016250 ns |
2792979 ns |
1.08 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4038791.5 ns |
4064520.5 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
573610.5 ns |
588037 ns |
0.98 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7604583 ns |
7500166 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7463541 ns |
7470229.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7455125 ns |
7393937.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8289167 ns |
8209000 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1371280 ns |
1331630 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
18784458 ns |
19529541 ns |
0.96 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
19115209 ns |
19142959 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
19129208 ns |
19022708 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
15938333 ns |
15703750 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23481833 ns |
23617083 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34089291.5 ns |
33598208 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37198333 ns |
41100666 ns |
0.91 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35465499.5 ns |
35022333 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1848724 ns |
1855178.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
190170125 ns |
189352250 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
164549583.5 ns |
163568208 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
152546334 ns |
158452896 ns |
0.96 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
449539583 ns |
438607167 ns |
1.02 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13896040 ns |
13925600.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
290337958 ns |
287704167 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
337255125 ns |
337952937.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
298835500 ns |
291466708 ns |
1.03 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
410982833 ns |
395696000 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24167 ns |
21334 ns |
1.13 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
23208 ns |
24375 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25917 ns |
25771 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22042 ns |
23584 ns |
0.93 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
96595 ns |
95861 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
104041 ns |
103625 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
103625 ns |
103708 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104916 ns |
104625 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
104708 ns |
103479.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
511860.5 ns |
510517.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6750 ns |
5750 ns |
1.17 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6250 ns |
7208 ns |
0.87 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8083.5 ns |
7666.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6291 ns |
7166 ns |
0.88 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68925 ns |
68604 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15125 ns |
14708 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15750 ns |
15916 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16208 ns |
16666 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15458 ns |
14667 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
484262 ns |
483804.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3013771 ns |
2876500 ns |
1.05 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2072375 ns |
2063833 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2277479 ns |
2288208 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4847333 ns |
4870416 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
587397 ns |
587700 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23555625 ns |
23421375 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
17948833 ns |
17990750 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17933229 ns |
18312792 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35848896 ns |
35646292 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3108253 ns |
3104605 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33244416.5 ns |
33240625 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27673458 ns |
27662417 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27373125 ns |
27837459 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
42417062.5 ns |
41788833 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
77250 ns |
72083 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
72541 ns |
78729 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76417 ns |
75729.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
73125 ns |
72459 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100621 ns |
100762.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
292708 ns |
204458 ns |
1.43 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
308687.5 ns |
219041 ns |
1.41 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
207417 ns |
320458 ns |
0.65 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215916.5 ns |
205312.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
544548.5 ns |
541454.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12667 ns |
11333 ns |
1.12 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11833 ns |
12416 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13667 ns |
13834 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12333 ns |
13125 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
71416 ns |
69856.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27083 ns |
26520.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27042 ns |
27458 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27708 ns |
28291 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26833 ns |
26500 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
479407 ns |
473341 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
13020.5 ns |
11833 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12500 ns |
12750 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14541 ns |
14333 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12687.5 ns |
13375 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
51771 ns |
51587 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26500 ns |
26375 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25583 ns |
26583 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26416 ns |
26666 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26417 ns |
26417 ns |
1 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
301146.5 ns |
302777.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
180292 ns |
178666.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
180542 ns |
180292 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
185667 ns |
184416.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
179958 ns |
179709 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55412 ns |
55677 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
586395.5 ns |
591146.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
587271 ns |
588583 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
583750 ns |
593062 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
583562.5 ns |
582708.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
283263 ns |
285027 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7083 ns |
5667 ns |
1.25 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5958 ns |
7167 ns |
0.83 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8187.5 ns |
7895.5 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5959 ns |
7291 ns |
0.82 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
70367 ns |
69657.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14500 ns |
14167 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14459 ns |
14958 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15417 ns |
15854.5 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14250 ns |
14583 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
466069 ns |
460443 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1219208 ns |
1194208.5 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1347625 ns |
1216792 ns |
1.11 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1286250 ns |
1262604 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1314000 ns |
1318166.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301441 ns |
301559 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4137938 ns |
4098416 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4337750 ns |
4352937.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4589792 ns |
4631875 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
4675291 ns |
4436562.5 ns |
1.05 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1048315 ns |
1042661.5 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1750 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1792 ns |
1833 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23638.5 ns |
23523 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4958 ns |
4792 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4834 ns |
4875 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4875 ns |
4916 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4916 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
189956.5 ns |
187370 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6375 ns |
5500 ns |
1.16 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6166.5 ns |
6334 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7917 ns |
8604 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6291.5 ns |
7292 ns |
0.86 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
53487.5 ns |
54466 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11292 ns |
10958 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10834 ns |
11792 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11709 ns |
11708.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11458 ns |
11166 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
327436.5 ns |
330839 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23234.5 ns |
22873.5 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2667 ns |
2708 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2709 ns |
2959 ns |
0.92 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2959 ns |
3042 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2875 ns |
2750 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
160479.5 ns |
157537.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
14208 ns |
10750 ns |
1.32 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11709 ns |
13708 ns |
0.85 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
15833 ns |
14958 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11854 ns |
14583 ns |
0.81 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
55512.5 ns |
55574.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25062.5 ns |
25209 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24167 ns |
25250 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25416 ns |
25375 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24917 ns |
24979.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
288292 ns |
292656 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4166 ns |
4208 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4167 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24783 ns |
24774 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16209 ns |
16333 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16250 ns |
16125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16208 ns |
16125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16125 ns |
16084 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
197133 ns |
195031.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5750 ns |
5708 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5667 ns |
5750 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5750 ns |
5750 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5750 ns |
5709 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
33101 ns |
33326 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20916 ns |
21125 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20750 ns |
20875 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21542 ns |
21583 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20833 ns |
21500 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
175254.5 ns |
175195.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
399125 ns |
415708 ns |
0.96 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
370396 ns |
376667 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
496375 ns |
471499.5 ns |
1.05 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
503625 ns |
523500 ns |
0.96 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66515.5 ns |
66680.5 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
991583.5 ns |
924750.5 ns |
1.07 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
871520.5 ns |
849291 ns |
1.03 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1231125 ns |
1217521 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
1330834 ns |
1302292 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
189653 ns |
189339 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80500 ns |
79792 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
79416 ns |
82667 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
85042 ns |
84208 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80708 ns |
82833 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192887 ns |
193132 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1919750 ns |
1917625.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1915625 ns |
1915292 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1925188 ns |
1940917 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1929833 ns |
1896541 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
390355 ns |
395963 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21755 ns |
21798 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1791 ns |
1875 ns |
0.96 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
167888 ns |
167505 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7333.5 ns |
5834 ns |
1.26 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7125 ns |
7500 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9292 ns |
9958 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7292 ns |
6875 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
55708 ns |
58244.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
9375 ns |
1 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8833 ns |
9333 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9208 ns |
9354.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9500 ns |
9625 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
296941.5 ns |
302935 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120225916.5 ns |
119443416.5 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173999042 ns |
173896250 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147946708 ns |
155811625 ns |
0.95 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
104239875 ns |
108054541 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5472185 ns |
5469386 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
615397103.5 ns |
616746166.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
556088583 ns |
555745625 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
451781708 ns |
468855125 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
773212500 ns |
760571396 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34986769 ns |
34956216 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
652621166 ns |
648663875 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
666689625 ns |
664591146 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
587828354 ns |
601178041.5 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
742101875 ns |
746069334 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59458 ns |
59458 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47625 ns |
47083 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47875 ns |
39166 ns |
1.22 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84208 ns |
83208 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37180 ns |
37582 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1716167 ns |
1926708 ns |
0.89 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1973500 ns |
1983042 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1980542 ns |
1986937.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1912021 ns |
1850250 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
172659 ns |
173017.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
267500 ns |
265187.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
268541 ns |
267959 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
285124.5 ns |
276771 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267208 ns |
266917 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
122124.5 ns |
128834.5 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
675208.5 ns |
604083 ns |
1.12 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
686125 ns |
692833.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
692625 ns |
705709 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
595813 ns |
590291.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
668007 ns |
683429 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2110520.5 ns |
2195333 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2200312.5 ns |
2225625 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2218999.5 ns |
2230583 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2164062.5 ns |
2183333 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132950.5 ns |
133325.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5484729 ns |
5480833 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5495166.5 ns |
5508958 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5513833 ns |
5585895.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5576458 ns |
5490125 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
723408 ns |
766206 ns |
0.94 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
643583 ns |
646750 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
660084 ns |
660250 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
644334 ns |
642917 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
655417 ns |
647375 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46996 ns |
47306 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1817209 ns |
1828875 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1755334 ns |
1721042 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1724416.5 ns |
1665209 ns |
1.04 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2099167 ns |
2097000 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
223477 ns |
223896.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58167 ns |
58667 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46042 ns |
47750 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45583 ns |
38958 ns |
1.17 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
85292 ns |
82750 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28503.5 ns |
29191 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1997875 ns |
2029083.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2084292 ns |
2091166 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2093020.5 ns |
2107249.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2018417 ns |
1994854.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
188016.5 ns |
190986 ns |
0.98 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13344438 ns |
13371291 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12452083 ns |
12436583.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12494604 ns |
12675625 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15015250.5 ns |
15146959 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
517943 ns |
517535.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47201916.5 ns |
47259416 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41812125 ns |
41746209 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
40768708.5 ns |
41384750 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
59301625 ns |
58440500 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3218709 ns |
3203835 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
74145812.5 ns |
73984667 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
68216667 ns |
91223791.5 ns |
0.75 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90449292 ns |
90609938 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
99506521 ns |
77234000 ns |
1.29 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58750 ns |
59000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47291 ns |
47417 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47167 ns |
38917 ns |
1.21 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84500 ns |
81125 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
46723 ns |
47741 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1914208.5 ns |
1911646 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1974166 ns |
1970541 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1975625 ns |
1976417 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1910312 ns |
1882083 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190300.5 ns |
195868.5 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
416 ns |
333 ns |
1.25 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
31677 ns |
32615 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6125 ns |
6500 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6125 ns |
6375 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6750 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6584 ns |
6375 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
168934.5 ns |
176818 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32058 ns |
32102 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2625 ns |
2625 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2625 ns |
2875 ns |
0.91 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2834 ns |
2916 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2791 ns |
2625 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
159386 ns |
164236.5 ns |
0.97 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
287071478.5 ns |
286096229 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
341626334 ns |
339570541 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
313518479.5 ns |
321242167 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
275420708 ns |
271493208 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7107755 ns |
7111512 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
997227041 ns |
987492667 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
942763125 ns |
939040416 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
859332667 ns |
868433209 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1180396417 ns |
1162204042 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34002052 ns |
34040446 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1309634583.5 ns |
1310851000.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1365354146 ns |
1685402625 ns |
0.81 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1654700958 ns |
1648347125 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1647288250 ns |
1310788750 ns |
1.26 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1416125 ns |
1412625 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1460020.5 ns |
1412041.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1415625 ns |
1424625 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1423417 ns |
1408334 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
127102 ns |
128501 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5015083 ns |
5028875 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5026958.5 ns |
5030104 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5019812.5 ns |
5062042 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5047500 ns |
5014021 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
494309 ns |
597004.5 ns |
0.83 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
168408291 ns |
168008834 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
132051562.5 ns |
130299417 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
129591875 ns |
148283479 ns |
0.87 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
159716042 ns |
161948354 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4883020 ns |
5052268 ns |
0.97 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
830365834 ns |
662817209 ns |
1.25 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
642499125 ns |
492884417 ns |
1.30 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
508881375 ns |
507367709 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
844604875 ns |
678320708 ns |
1.25 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16052527 ns |
17294527 ns |
0.93 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8935625 ns |
8884604 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8710770.5 ns |
8801959 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7873916.5 ns |
8221541.5 ns |
0.96 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
10411083 ns |
10127167 ns |
1.03 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1593477 ns |
1611762 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36657875 ns |
36027125 ns |
1.02 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
36688708 ns |
36933063 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33442646 ns |
34547750 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
40180854 ns |
38824854 ns |
1.03 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6470804 ns |
6452267 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47416.5 ns |
47375 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47500 ns |
47250 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47792 ns |
47542 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47459 ns |
47333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
19446 ns |
19020 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50291 ns |
50312.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
53000 ns |
50500 ns |
1.05 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50500 ns |
50958.5 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50500 ns |
50333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
178483 ns |
226580 ns |
0.79 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
6542 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7416 ns |
7187.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8667 ns |
9083 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7292 ns |
8625 ns |
0.85 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
79017 ns |
117383.5 ns |
0.67 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10292 ns |
9625 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9833 ns |
10208 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10959 ns |
10333.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10208 ns |
10209 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
488962 ns |
723908.5 ns |
0.68 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8250 ns |
6083 ns |
1.36 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6667 ns |
8250 ns |
0.81 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
9208 ns |
9417 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6687.5 ns |
8375 ns |
0.80 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
84730.5 ns |
157024.5 ns |
0.54 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12875 ns |
13292 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13375 ns |
13792 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13375 ns |
13708 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13500 ns |
12834 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
422778.5 ns |
618769 ns |
0.68 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1000 ns |
1042 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
32216.5 ns |
32863 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8000 ns |
7875 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7917 ns |
8000 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8208 ns |
8208 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
8250 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
191222.5 ns |
246953.5 ns |
0.77 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23395.5 ns |
25062.5 ns |
0.93 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23250 ns |
23291.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
25041 ns |
23542 ns |
1.06 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23250 ns |
23250 ns |
1 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18950 ns |
18661 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52084 ns |
52625 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52958 ns |
52833 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52500 ns |
52875 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52708.5 ns |
52333 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
230217.5 ns |
364018 ns |
0.63 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1452958 ns |
1403750 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1446334 ns |
1451354 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1398770.5 ns |
1407542 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1450833 ns |
1406458 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
195688 ns |
196760 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5010500 ns |
5023250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5014041 ns |
5018687.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4927959 ns |
5042125 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5050083 ns |
5001750 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
535939 ns |
766930 ns |
0.70 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3066250 ns |
3048708 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2084000 ns |
2082646 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2295333 ns |
2300125 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4884812.5 ns |
4855000 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
585473 ns |
583278 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24408167 ns |
24263250 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18817958 ns |
18905459 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18750208 ns |
19193375 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
37356750 ns |
36575416 ns |
1.02 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3182445 ns |
3216229 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33999792 ns |
34013563 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28397375 ns |
28342229 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27978125 ns |
28436750 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
42315042 ns |
43339875 ns |
0.98 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144296292 ns |
144288959 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
142454750 ns |
142279583 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
123902542 ns |
126469000.5 ns |
0.98 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
167495375 ns |
168866000 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22537902 ns |
22582893 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1415751333 ns |
1275599313 ns |
1.11 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1144590833 ns |
1058487228.5 ns |
1.08 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
680940541 ns |
712851209 ns |
0.96 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
679303041 ns |
668538250 ns |
1.02 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118837359 ns |
119108875 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74729 ns |
83125 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
79104.5 ns |
76208 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
77500 ns |
78125 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
85020.5 ns |
72729 ns |
1.17 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
189527.5 ns |
365097 ns |
0.52 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
204042 ns |
189959 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
282250 ns |
287792 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
291917 ns |
268875 ns |
1.09 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
293625 ns |
189583.5 ns |
1.55 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1050472 ns |
1559670.5 ns |
0.67 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35430937.5 ns |
35476167 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
35413625 ns |
35447729.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32256541.5 ns |
32304459 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
41546833.5 ns |
40935146 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5844177 ns |
5843273 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
148391375 ns |
147875542 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
152657770.5 ns |
152751312.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
135627417 ns |
139824437 ns |
0.97 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
229003479.5 ns |
287719375 ns |
0.80 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34891742 ns |
34882914 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120913500 ns |
120880395.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174393833 ns |
174358791 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148088167 ns |
155429791 ns |
0.95 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
102952000 ns |
106966959 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5489216 ns |
5456342 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
469422083 ns |
470623375 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
467661083 ns |
466918000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
440419292 ns |
456589562.5 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
759917916 ns |
742113834 ns |
1.02 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32237167 ns |
32255425 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
708309833.5 ns |
706243291.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
653178271 ns |
652697541.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
576002271 ns |
591007625 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
865545375 ns |
851805375 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1342979.5 ns |
1320583.5 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
979124.5 ns |
965875 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
976458 ns |
736687.5 ns |
1.33 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2106791 ns |
1944666.5 ns |
1.08 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
578569 ns |
564187.5 ns |
1.03 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2944667 ns |
2971708.5 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2608958 ns |
2620334 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2626167 ns |
2535604 ns |
1.04 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3768167 ns |
3604083.5 ns |
1.05 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1697045 ns |
1878347.5 ns |
0.90 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
6633209 ns |
6649958 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
6526458 ns |
6493042 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
6498709 ns |
6437479.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
4513208 ns |
4435750 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7375 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6208 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
5375 ns |
1.14 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10542 ns |
9916 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25661 ns |
25400 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212354.5 ns |
213645.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220458 ns |
221833 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220875 ns |
221250 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
216000 ns |
205875 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
236310 ns |
293719.5 ns |
0.80 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
314574854.5 ns |
301604437.5 ns |
1.04 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
222487417 ns |
221356625 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
197721500 ns |
223278083.5 ns |
0.89 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
317671875 ns |
312163250 ns |
1.02 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7677705 ns |
7672763 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1081788271 ns |
1078062604.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
918557958.5 ns |
896268771 ns |
1.02 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
817573542 ns |
880668729 ns |
0.93 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1180088625 ns |
1161143188 ns |
1.02 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26749727.5 ns |
26517571 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6958 ns |
5500 ns |
1.27 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6583.5 ns |
5750 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8791.5 ns |
9437.5 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6312.5 ns |
5875 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
136942.5 ns |
201555 ns |
0.68 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7208 ns |
7500 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7417 ns |
7458 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7375 ns |
7750 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
7041.5 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
559843 ns |
699933.5 ns |
0.80 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
541 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
583 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23894.5 ns |
23724.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9125 ns |
9208 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9417 ns |
9625 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9250 ns |
9604.5 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9791 ns |
9042 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
191795.5 ns |
234828.5 ns |
0.82 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
353916.5 ns |
351500 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
352604 ns |
350896 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
351500 ns |
354624.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
353834 ns |
351708 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21305 ns |
20984 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
826667 ns |
775417 ns |
1.07 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
805417 ns |
824916 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
827124.5 ns |
830958 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
828792 ns |
823958 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
254167.5 ns |
306663 ns |
0.83 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
332542 ns |
338083 ns |
0.98 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
338541 ns |
341500 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
451417 ns |
443667 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
309709 ns |
325667 ns |
0.95 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18587 ns |
17821 ns |
1.04 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
692020.5 ns |
696042 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
740708.5 ns |
739416.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1027750 ns |
1042874.5 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
692021 ns |
692645.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
221446.5 ns |
273141.5 ns |
0.81 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
347854 ns |
358458.5 ns |
0.97 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
350209 ns |
349125 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
435375 ns |
431291.5 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
348708 ns |
370875 ns |
0.94 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22822 ns |
22357.5 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
757750 ns |
756625 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
744625 ns |
744208.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1078291 ns |
1073250 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
821875 ns |
818125.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
214919 ns |
221398.5 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3417 ns |
3459 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3541 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3584 ns |
3792 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3625 ns |
3291 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
18532 ns |
17956 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4333 ns |
4208 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4250 ns |
4208 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4292 ns |
4416 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4500 ns |
4125 ns |
1.09 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
212438.5 ns |
275839.5 ns |
0.77 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4250 ns |
3792 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3750 ns |
3375 ns |
1.11 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6709 ns |
6750 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4291 ns |
6625 ns |
0.65 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
163501.5 ns |
205448.5 ns |
0.80 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
8334 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8166 ns |
8459 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8583 ns |
8500 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8542 ns |
8541 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
983714 ns |
1183984 ns |
0.83 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203854.5 ns |
202625 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209583 ns |
210416 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210000 ns |
209292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
201667 ns |
200000 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34834 ns |
34588 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
648000.5 ns |
603792 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
620292 ns |
670625 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
666375 ns |
630958 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
593417 ns |
631187.5 ns |
0.94 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
288050.5 ns |
352652 ns |
0.82 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
961208 ns |
967521 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
937667 ns |
927063 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
959875 ns |
964437.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
1295041 ns |
1281853.5 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
208012 ns |
207244 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4450542 ns |
4451771 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4458333 ns |
4482750 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4312625 ns |
4474208 ns |
0.96 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
6617292 ns |
6201166 ns |
1.07 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
982525 ns |
945549 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3833 ns |
3604.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3458 ns |
3167 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5333.5 ns |
6792 ns |
0.79 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3750 ns |
3167 ns |
1.18 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
164779 ns |
233201 ns |
0.71 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7583 ns |
7500 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
7375 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7167 ns |
7291 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7833 ns |
7083 ns |
1.11 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
887301 ns |
1014881 ns |
0.87 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1637583.5 ns |
1602833.5 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1191458.5 ns |
1187916 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1370916 ns |
1364062 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2374125.5 ns |
2343729.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215947 ns |
212955.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12337083 ns |
12334792 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9576709 ns |
9602042 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9305083 ns |
9404958 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18247292 ns |
17966833 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1951736 ns |
1949853 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17394459 ns |
17347084 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14383042 ns |
14365000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14364687.5 ns |
14512666 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21105250 ns |
21005479.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
136000 ns |
89791 ns |
1.51 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
90333.5 ns |
91729.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
93458 ns |
94291 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
134291.5 ns |
117416.5 ns |
1.14 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125598 ns |
126285 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2030104.5 ns |
2023917 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2023020.5 ns |
2013416.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2027042 ns |
2058875 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2053083 ns |
2027875 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
972659 ns |
1031286 ns |
0.94 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
343042 ns |
346791.5 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
349374.5 ns |
343583.5 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
396791.5 ns |
412250 ns |
0.96 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
290500 ns |
306166 ns |
0.95 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15157 ns |
16010 ns |
0.95 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
704250 ns |
702291 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
724709 ns |
728979.5 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
1023958.5 ns |
1025458 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
648417 ns |
639875 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
186915 ns |
193209 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7292 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
6083 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
5334 ns |
1.12 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10334 ns |
10000 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33673 ns |
33620 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225000 ns |
220479.5 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220041 ns |
231958 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229792 ns |
232041 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220125 ns |
220500 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
287431 ns |
311751 ns |
0.92 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3708 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3709 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3667 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22815 ns |
22440 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14459 ns |
14500 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14417 ns |
14417 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14375 ns |
14167 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14208 ns |
14291 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
456593.5 ns |
468658 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
140625 ns |
95166 ns |
1.48 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
95270.5 ns |
138021 ns |
0.69 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
97229 ns |
99167 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
136833.5 ns |
122458 ns |
1.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125090.5 ns |
125691 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1923208 ns |
1931875 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1928583.5 ns |
1954979 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1922854.5 ns |
1946854 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1940500.5 ns |
1923729.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
921068 ns |
940251.5 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
869000 ns |
880500 ns |
0.99 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
818833.5 ns |
815125 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1217542 ns |
1172292 ns |
1.04 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
938229 ns |
960167 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
270679 ns |
270704 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2817334 ns |
2803000 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2503458 ns |
2526833 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3351896 ns |
3361333 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3415250 ns |
3405875 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1555869 ns |
1569154 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16354 ns |
15146 ns |
1.08 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15417 ns |
18000 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21604 ns |
21666 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15416 ns |
18125 ns |
0.85 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
140857.5 ns |
141811.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225833 ns |
217083 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
217062.5 ns |
229375 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229688 ns |
257396 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
261833 ns |
215833 ns |
1.21 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
617305 ns |
635765.5 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
222458 ns |
219750 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
220354.5 ns |
221500 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
226020.5 ns |
226021 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
221146 ns |
223937.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
266148.5 ns |
270450 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
558417 ns |
509917 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
508937.5 ns |
557729 ns |
0.91 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
563833 ns |
549792 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
510625 ns |
555791 ns |
0.92 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1284278 ns |
1308245 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
327750 ns |
333479 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
333249.5 ns |
335541.5 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
389229 ns |
437333 ns |
0.89 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
297896 ns |
319417 ns |
0.93 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16594 ns |
16583 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
714667 ns |
715333 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
729250 ns |
730292 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
1018708 ns |
1025458.5 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
660938 ns |
655792 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
192223 ns |
193313 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17458.5 ns |
17625 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17667 ns |
17625 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22542 ns |
20437.5 ns |
1.10 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17833 ns |
18000 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
143390.5 ns |
144711.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
224250 ns |
216667 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212209 ns |
224083 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
251666 ns |
226625 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218708 ns |
223417 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
870740.5 ns |
903796 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6500 ns |
4625 ns |
1.41 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4541 ns |
6750 ns |
0.67 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7833 ns |
7438 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4666 ns |
6625 ns |
0.70 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
181869 ns |
174159.5 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11000 ns |
10437.5 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10542 ns |
10750 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10584 ns |
10770.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11083 ns |
10833 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1004419 ns |
1024421 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3395.5 ns |
3646 ns |
0.93 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3625 ns |
3334 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6792 ns |
5625 ns |
1.21 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4125 ns |
3500 ns |
1.18 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
202818 ns |
231660 ns |
0.88 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7458 ns |
7708 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
7792 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7625 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7750 ns |
7167 ns |
1.08 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1021526.5 ns |
1037611.5 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23737042 ns |
23838833 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34849729 ns |
33990646 ns |
1.03 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37801875 ns |
41585708 ns |
0.91 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
36279417 ns |
34896229 ns |
1.04 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1853795 ns |
1839186 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
184527250 ns |
184662833 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
159658833 ns |
159634000 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146432958.5 ns |
151746084 ns |
0.96 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
422145604.5 ns |
415075875 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16509006 ns |
16506413 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
427907292 ns |
427351833 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
254738437 ns |
251624521 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
232050291.5 ns |
233926312.5 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
497717146 ns |
484091542 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182792 ns |
181666 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
183000 ns |
183416.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
189042 ns |
186125 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
184208 ns |
183834 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
166826.5 ns |
173529.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
632937.5 ns |
587541 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
631145.5 ns |
600458 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
629416 ns |
632375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
603979 ns |
631354 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
995462.5 ns |
1005977 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3817645.5 ns |
3816041.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3624583 ns |
3637833 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3484416 ns |
3539646 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
5475291 ns |
5351396 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
538100 ns |
554127 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17364000 ns |
17372333 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17178979 ns |
17218458.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16580708.5 ns |
16979478.5 ns |
0.98 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
23174500 ns |
22177625 ns |
1.04 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2617022.5 ns |
2616933 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
459 ns |
1.18 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31794 ns |
32036 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9500 ns |
9667 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8709 ns |
9750 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9729.5 ns |
10125 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9916.5 ns |
9291 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
262089 ns |
260858 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
499603584 ns |
506491042 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
432109771 ns |
428949104 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
428755896 ns |
474815000 ns |
0.90 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
676396458.5 ns |
671461979 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12478713 ns |
12484614.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
2045189437.5 ns |
2043435104.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1633734250 ns |
1631358667 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1493383584 ns |
1546812271 ns |
0.97 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2223823937.5 ns |
2216473375.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49038762 ns |
49204869.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1639166 ns |
1642542 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1203250 ns |
1194625 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1395479 ns |
1380791 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2461958 ns |
2487084 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
219168 ns |
215546 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12694958.5 ns |
12711687.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9921500 ns |
9927625 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9744938 ns |
9788604.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18522333 ns |
18464437.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2024843 ns |
1995889.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17671042 ns |
17669166.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14707250 ns |
14709437.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14647166.5 ns |
14807645.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21550271 ns |
21465708 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26209 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26458 ns |
26250 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26208 ns |
26291 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26292 ns |
26167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24425 ns |
23873 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66958 ns |
66917 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67166 ns |
67333 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67000 ns |
67083 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67042 ns |
66833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
390613 ns |
382426 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203958 ns |
203834 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209500 ns |
209542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209084 ns |
209584 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199708 ns |
199584 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26539.5 ns |
26132 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
612750 ns |
613833.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
634104 ns |
636667 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
675250 ns |
671166.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
590333 ns |
628229.5 ns |
0.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
314844.5 ns |
308600 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
680396 ns |
671687.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
636917 ns |
645937.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
650020.5 ns |
644791.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
634875 ns |
676334 ns |
0.94 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132029.5 ns |
131667 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2232958 ns |
2241875 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2236166.5 ns |
2192250 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2233334 ns |
2297042 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2287187 ns |
2246249.5 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1125648.5 ns |
1114838 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17854 ns |
16791 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17520.5 ns |
17500 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23334 ns |
20958 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18583 ns |
16770.5 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
144775 ns |
143001 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
229875 ns |
230375 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
226666.5 ns |
231791.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
270583.5 ns |
266208 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
263854 ns |
260728.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
959972 ns |
959584 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
459 ns |
542 ns |
0.85 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23699 ns |
23163 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9834 ns |
9604.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9833 ns |
10292 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10000 ns |
10625 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
9584 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
260024.5 ns |
255611 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5458.5 ns |
5416.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6291 ns |
5750 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7875 ns |
9458 ns |
0.83 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7041.5 ns |
5708 ns |
1.23 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
189594.5 ns |
219432 ns |
0.86 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6834 ns |
7833 ns |
0.87 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7458 ns |
7750 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7750 ns |
7709 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7750 ns |
7000 ns |
1.11 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
744104.5 ns |
764584 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2125 ns |
1959 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2145.5 ns |
2083 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2209 ns |
2417 ns |
0.91 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2250 ns |
2208 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18220 ns |
17893 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6333 ns |
6875 ns |
0.92 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6834 ns |
6542 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6625 ns |
6583 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6667 ns |
6291 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
313257 ns |
320459 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
748417 ns |
747709 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
746854.5 ns |
749833 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
749958 ns |
754999.5 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
753104 ns |
749375 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
22145 ns |
21357 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
793187.5 ns |
774854 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
777167 ns |
792687.5 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
818333 ns |
817042 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
814666 ns |
811166 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
296915 ns |
295013.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7334 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6000 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5875 ns |
5208.5 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10583 ns |
10166 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32972 ns |
33519 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
257708 ns |
219666 ns |
1.17 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
227458 ns |
268125 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
277979.5 ns |
252000.5 ns |
1.10 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
226000 ns |
213562 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
320926.5 ns |
354278 ns |
0.91 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12000 ns |
10875 ns |
1.10 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10520.5 ns |
11833 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13291 ns |
12770.5 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11084 ns |
12000 ns |
0.92 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
209902.5 ns |
238132.5 ns |
0.88 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24625 ns |
24708 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24708 ns |
24584 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24458 ns |
25292 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25667 ns |
24500 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1060096 ns |
1094067.5 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106026958 ns |
106709834 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
116938875 ns |
116906583.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120020687.5 ns |
127036729 ns |
0.94 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
118466791 ns |
117807000 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2643754 ns |
2657653 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
393860625 ns |
392558792 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
366891542 ns |
365774917 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
423298521.5 ns |
431860937.5 ns |
0.98 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
490615500 ns |
483379250 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15188440 ns |
15196086 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
758114312.5 ns |
758564875.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
580813833 ns |
761412666 ns |
0.76 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
744661000 ns |
748747542 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
956905604 ns |
765232583 ns |
1.25 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7709 ns |
6625 ns |
1.16 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7208 ns |
7334 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8250 ns |
9041.5 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7667 ns |
8250 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
222535 ns |
231038.5 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13750 ns |
14625 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14417 ns |
14750 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14625 ns |
14292 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14729.5 ns |
14542 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1001840 ns |
1043294.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8208 ns |
5875 ns |
1.40 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6792 ns |
7959 ns |
0.85 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
9417 ns |
9167 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7291 ns |
6333 ns |
1.15 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
221347 ns |
228571 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12250 ns |
12791 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12542 ns |
13167 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13292 ns |
13375 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13375 ns |
12333 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
753126 ns |
779066.5 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
341666 ns |
347625 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
340292 ns |
342625 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
397312.5 ns |
416812 ns |
0.95 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
287375 ns |
307083 ns |
0.94 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16899 ns |
17023 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
702916 ns |
710208.5 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
729771 ns |
732125 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
1023187.5 ns |
1032542 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
659396 ns |
653979.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
198658 ns |
200196.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
334 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
416 ns |
333 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23239 ns |
23569 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6667 ns |
6375 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6583 ns |
6584 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6666 ns |
6834 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6500 ns |
6042 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
240107.5 ns |
241926 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5750 ns |
5708 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5750 ns |
5834 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5916 ns |
5875 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5875 ns |
5708 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24262 ns |
24556.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21625 ns |
21562.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21625 ns |
22000 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
23834 ns |
21709 ns |
1.10 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21750 ns |
21167 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
263977 ns |
265433.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
189250 ns |
144917 ns |
1.31 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
186520.5 ns |
191292 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
149208 ns |
149333 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
184499.5 ns |
149250 ns |
1.24 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166758.5 ns |
167659 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1324770.5 ns |
1319292 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1331916 ns |
1331416 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1325208 ns |
1362958 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1369687.5 ns |
1326125 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1289191 ns |
1343729.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24709 ns |
22250 ns |
1.11 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22292 ns |
23791 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28459 ns |
25875 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22646 ns |
23666.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
280466 ns |
286115 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
176166.5 ns |
146125 ns |
1.21 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
131417 ns |
118500 ns |
1.11 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
177541.5 ns |
129833 ns |
1.37 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
131708 ns |
175792 ns |
0.75 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1404050 ns |
1461317 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23025 ns |
23352 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6500 ns |
6334 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6417 ns |
6459 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6709 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6125 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
256062 ns |
258095.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6458 ns |
4625 ns |
1.40 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4292 ns |
4125 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7541 ns |
7625 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5000 ns |
4895.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
245303.5 ns |
256357.5 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10542 ns |
9959 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9916 ns |
10125 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10458 ns |
10333 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10375 ns |
10333 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1323518 ns |
1358318.5 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1584 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1584 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1583 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23190 ns |
23389 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5667 ns |
5667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5667 ns |
5875 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5875 ns |
6000 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5917 ns |
5625 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
276691 ns |
275350.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6810792 ns |
6780125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6374000 ns |
6371125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6503375 ns |
6531396 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7605187.5 ns |
7625875 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215181 ns |
214804 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24053666.5 ns |
24015354 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21276917 ns |
21285667 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21121125 ns |
21085125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29932209 ns |
29769250 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2101536 ns |
2112477.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
37203604.5 ns |
37264541.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
34199125 ns |
45538167 ns |
0.75 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45694750 ns |
45665125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49486854 ns |
38235958 ns |
1.29 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6208 ns |
6208 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6500 ns |
5958.5 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8167 ns |
8750 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6604 ns |
7500 ns |
0.88 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
225712.5 ns |
236550 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8084 ns |
8750 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8584 ns |
8375 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8500 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8958 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1021276 ns |
1063848.5 ns |
0.96 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1561583 ns |
1554084 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1270125 ns |
1262375 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1633937.5 ns |
1631958.5 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2125417 ns |
2152375 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
271546.5 ns |
277465 ns |
0.98 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7898042 ns |
7881667 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6597312.5 ns |
6612667 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7213854.5 ns |
7276167 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10525771 ns |
10468062.5 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1821413.5 ns |
1876576 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
336125 ns |
346375 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
344083 ns |
348937.5 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
411833 ns |
423416.5 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
315250 ns |
336687 ns |
0.94 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46192.5 ns |
46390 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
748292 ns |
735208 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
784375 ns |
782458 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1075167 ns |
1081666.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
770125 ns |
758458.5 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
238566 ns |
311011.5 ns |
0.77 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397417 ns |
397375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287875 ns |
288250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288084 ns |
212583 ns |
1.36 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
751167 ns |
754104.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43477 ns |
44494 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
667584 ns |
675959 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
532083 ns |
532333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
532417 ns |
474000 ns |
1.12 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
974750 ns |
973417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
189406.5 ns |
189847 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
657500 ns |
599375 ns |
1.10 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
675000 ns |
650333 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
680771 ns |
660375 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
645958 ns |
655833.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131425.5 ns |
132321 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2458959 ns |
2469395.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2458479.5 ns |
2363959 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2467396 ns |
2519875.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2524354.5 ns |
2465916 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1178122 ns |
1345989 ns |
0.88 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
342958 ns |
345583 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
346625 ns |
342834 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
394000 ns |
416375 ns |
0.95 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
290792 ns |
306979.5 ns |
0.95 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15424 ns |
16330 ns |
0.94 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
704437.5 ns |
703104 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
724583.5 ns |
729708 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
1022792 ns |
1026937.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
653041.5 ns |
645959 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
194499 ns |
199885.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1459417 ns |
1460542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1497667 ns |
1500583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1499542 ns |
1491791 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1445708 ns |
1441917 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40504 ns |
41671 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5115041.5 ns |
5133500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5286250 ns |
5293250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5287729 ns |
5309521 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5025229 ns |
4977042 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
195999 ns |
197710 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3708 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3709 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3666 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34443 ns |
33362 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15167 ns |
15125 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15083 ns |
15500 ns |
0.97 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15334 ns |
15125 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15083 ns |
15083 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
366879.5 ns |
381216.5 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71208 ns |
71375 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71291 ns |
71208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71125 ns |
71583 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71583 ns |
71208 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113688 ns |
113946.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
317167 ns |
319833 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
317666 ns |
319208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
317583 ns |
327125 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
322750 ns |
318375 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
194647 ns |
195156 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1000 ns |
959 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
959 ns |
1042 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23307.5 ns |
23764 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8166 ns |
8084 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8041.5 ns |
8542 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8542 ns |
8416 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8250 ns |
7833.5 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
258638.5 ns |
263039 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
465979.5 ns |
472416 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
463208 ns |
468125 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
554208.5 ns |
549250 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
539083.5 ns |
550333 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129729 ns |
128804.5 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1382208 ns |
1375292 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1370959 ns |
1372208 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1642854.5 ns |
1633459 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
1636459 ns |
1580500 ns |
1.04 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
273889 ns |
274739 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31083 ns |
31574 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6542 ns |
6458 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
6875 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6666 ns |
6708 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6375 ns |
6000 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
260736 ns |
261869 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1729458 ns |
1727625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1779500 ns |
1783958 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1782166 ns |
1730916 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1727271 ns |
1729333 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
168880.5 ns |
168455 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4355833 ns |
4352625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4361417 ns |
4372937.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4372145.5 ns |
4412458 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4411083 ns |
4358042 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1156995.5 ns |
1234725 ns |
0.94 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7625 ns |
6709 ns |
1.14 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7083 ns |
6584 ns |
1.08 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6750 ns |
7417 ns |
0.91 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7167 ns |
6542 ns |
1.10 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
19813 ns |
19619.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
70708 ns |
51083 ns |
1.38 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
51167 ns |
35625 ns |
1.44 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
51917 ns |
49875 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
52937 ns |
70208 ns |
0.75 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
208382 ns |
211156 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
347750 ns |
354291 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
345959 ns |
347584 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
409333 ns |
432708 ns |
0.95 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
300874.5 ns |
319521.5 ns |
0.94 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18451.5 ns |
18053 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
719479.5 ns |
719104 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
731083.5 ns |
735979 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
1027042 ns |
1039063 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
676020.5 ns |
672750 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
329987.5 ns |
343671.5 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75334 ns |
75417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75354.5 ns |
75333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75459 ns |
75708 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75625 ns |
74709 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46972 ns |
46983 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
324916 ns |
324417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
327208 ns |
327000 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
324041 ns |
334917 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
334959 ns |
324083 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
211223 ns |
207721.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1485291 ns |
1486334 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1526250 ns |
1527500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1527875 ns |
1519000 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1468708 ns |
1466541 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
51383 ns |
51914 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5122729.5 ns |
5119333.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5285958 ns |
5300396 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5294688 ns |
5303708 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5003000 ns |
4989375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
201671.5 ns |
201413 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28250 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28125 ns |
28166 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28208 ns |
28333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28208 ns |
28208 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24621 ns |
24393 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66458 ns |
66542 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66625 ns |
66292 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66292 ns |
66542 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66500 ns |
66584 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
506316 ns |
530998 ns |
0.95 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1489750 ns |
1493250 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1142833 ns |
1120167 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1115125 ns |
947625 ns |
1.18 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2175167 ns |
2256500 ns |
0.96 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
574530.5 ns |
570331 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3074667 ns |
3075542 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2740583.5 ns |
2732479 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2749271.5 ns |
2643125 ns |
1.04 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3881042 ns |
3814770.5 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
1973310 ns |
2010818 ns |
0.98 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
8822500 ns |
8738917 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
8769542 ns |
8777854.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
8776625 ns |
8781417 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
6375666 ns |
6360687.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
84125 ns |
81146 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80084 ns |
81708.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
86584 ns |
83708 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83333 ns |
87687.5 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
191975 ns |
192383.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2020083 ns |
2016791.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2017792 ns |
2012708 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2026459 ns |
2041312 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2029041.5 ns |
2015208 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
735374 ns |
798885.5 ns |
0.92 |
This comment was automatically generated by workflow using github-action-benchmark.
No description provided.