Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: try re-enabling enzyme testing on 0.13.14 #1042

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Nov 6, 2024

No description provided.

Copy link
Contributor

github-actions bot commented Nov 6, 2024

Benchmark Results (ASV)

main dbbf69e... main/dbbf69ea5cd1f8...
basics/overhead 0.124 ± 0.0013 μs 0.127 ± 0.0011 μs 0.97
time_to_load 1.17 ± 0.02 s 1.19 ± 0.012 s 0.976

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal
Copy link
Member Author

avik-pal commented Nov 6, 2024

Need to also reenable some of the tests manually in LuxLib

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 239783f Previous: 900c21c Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4084 ns 4270.5 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4542 ns 4000 ns 1.14
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6166.5 ns 5875 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4625 ns 4895.5 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61042.5 ns 59833 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10812.5 ns 10375 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9958 ns 9958 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 10792 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10666 ns 10125 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 426414 ns 422438 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1041 ns 1083 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3125 ns 1000 ns 3.13
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1334 ns 1417 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1292 ns 1125 ns 1.15
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18161 ns 18109 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4125 ns 4166 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3958 ns 4125 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4209 ns 4187.5 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4208 ns 4042 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 109971 ns 109209 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57292 ns 57645.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46667 ns 47000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46375 ns 38125 ns 1.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83666 ns 82084 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36679 ns 37455 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2034750 ns 1973687 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2090625.5 ns 2089416 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1873875 ns 2085625 ns 0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2023041 ns 1985813 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194726.5 ns 195917 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 149333 ns 146416.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144312.5 ns 147020.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147584 ns 145667 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 143708 ns 145604.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167008 ns 166391 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1121958 ns 1129209 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1117416.5 ns 1126375 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1106854 ns 1147667 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1120042 ns 1104209 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 526749 ns 521058.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 3416.5 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3834 ns 3333 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7083 ns 6333 ns 1.12
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3750 ns 3250 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70801 ns 66594 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9395.5 ns 8792 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9375 ns 9291 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9500 ns 9250 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9959 ns 9292 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 496293.5 ns 493812 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15270.5 ns 14750 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16125 ns 15458 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19042 ns 19167 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15666.5 ns 16437.5 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 52891 ns 53833 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213709 ns 215416.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226958 ns 213208.5 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214875 ns 214271 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214500 ns 227104 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 269454 ns 271460 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 458 ns 542 ns 0.85
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 625 ns 0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 792 ns 0.95
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 583 ns 1.14
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17740 ns 17470 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1542 ns 1750 ns 0.88
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1417 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1542 ns 1709 ns 0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1542 ns 1645.5 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 101893 ns 101826.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7250 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5792 ns 5916 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5792 ns 5292 ns 1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10584 ns 10000 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23410 ns 23857.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225104 ns 226895.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232958 ns 230375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229417 ns 231584 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222145.5 ns 258625 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 167116 ns 167659 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3916 ns 3833 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23806 ns 23468 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16750 ns 16750 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16667 ns 17042 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16875 ns 17000 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16750 ns 16625 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 162865 ns 160597 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 577167 ns 572166 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 599291 ns 575000 ns 1.04
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 576875 ns 587458 ns 0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 580250 ns 578334 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 112788 ns 113397 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1418584 ns 1421708 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1445541.5 ns 1420125 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1419458 ns 1430083 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1427395.5 ns 1413292 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 213449 ns 209669.5 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1072354 ns 1074458 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 961396 ns 958250.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1354458 ns 1334396 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1292937.5 ns 1310875 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 272319.5 ns 269120.5 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5988583 ns 5769437 ns 1.04
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4525584 ns 4470625 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4903583 ns 4941021 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5547917 ns 5552042 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1062959 ns 1066489 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23622 ns 23585 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2167 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2166 ns 2250 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2208 ns 2125 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 175399.5 ns 169900 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833.5 ns 4084 ns 1.43
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4833.5 ns 6250 ns 0.77
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7291 ns 7209 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4709 ns 6125 ns 0.77
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65926.5 ns 64199 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11875 ns 11083 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11792 ns 11625 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12084 ns 12000 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11417 ns 10917 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 449188 ns 446167.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7458 ns 6042 ns 1.23
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7000 ns 7042 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8083 ns 8833 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6875 ns 7250 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 50587 ns 51074.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16500 ns 17292 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18291 ns 18334 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18167 ns 18083 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17709 ns 17229.5 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 297864.5 ns 299895.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32016 ns 32630 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8833 ns 8458 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8833.5 ns 9041 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9541 ns 9166 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8959 ns 8459 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 156346 ns 158907 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64625 ns 64625 ns 1
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64542 ns 64250 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64542 ns 65000 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 65042 ns 64667 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111541.5 ns 111460 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 292625 ns 289667 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 280333 ns 279750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 277312.5 ns 289625 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 290709 ns 281250 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 186964 ns 184453.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3370271 ns 3347125 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3013792 ns 3015520.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3016250 ns 2792979 ns 1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4038791.5 ns 4064520.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 573610.5 ns 588037 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7604583 ns 7500166 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7463541 ns 7470229.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7455125 ns 7393937.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8289167 ns 8209000 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1371280 ns 1331630 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18784458 ns 19529541 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19115209 ns 19142959 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19129208 ns 19022708 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15938333 ns 15703750 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23481833 ns 23617083 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34089291.5 ns 33598208 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37198333 ns 41100666 ns 0.91
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35465499.5 ns 35022333 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1848724 ns 1855178.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 190170125 ns 189352250 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164549583.5 ns 163568208 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152546334 ns 158452896 ns 0.96
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 449539583 ns 438607167 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13896040 ns 13925600.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 290337958 ns 287704167 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 337255125 ns 337952937.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 298835500 ns 291466708 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 410982833 ns 395696000 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24167 ns 21334 ns 1.13
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23208 ns 24375 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25917 ns 25771 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22042 ns 23584 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96595 ns 95861 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104041 ns 103625 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103625 ns 103708 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104916 ns 104625 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104708 ns 103479.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 511860.5 ns 510517.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6750 ns 5750 ns 1.17
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6250 ns 7208 ns 0.87
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8083.5 ns 7666.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6291 ns 7166 ns 0.88
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68925 ns 68604 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15125 ns 14708 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15750 ns 15916 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16208 ns 16666 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15458 ns 14667 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 484262 ns 483804.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3013771 ns 2876500 ns 1.05
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2072375 ns 2063833 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2277479 ns 2288208 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4847333 ns 4870416 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 587397 ns 587700 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23555625 ns 23421375 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17948833 ns 17990750 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17933229 ns 18312792 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35848896 ns 35646292 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3108253 ns 3104605 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33244416.5 ns 33240625 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27673458 ns 27662417 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27373125 ns 27837459 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42417062.5 ns 41788833 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 77250 ns 72083 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 72541 ns 78729 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76417 ns 75729.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73125 ns 72459 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 100621 ns 100762.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 292708 ns 204458 ns 1.43
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 308687.5 ns 219041 ns 1.41
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 207417 ns 320458 ns 0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215916.5 ns 205312.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 544548.5 ns 541454.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12667 ns 11333 ns 1.12
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11833 ns 12416 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13667 ns 13834 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12333 ns 13125 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71416 ns 69856.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27083 ns 26520.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27042 ns 27458 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27708 ns 28291 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26833 ns 26500 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 479407 ns 473341 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13020.5 ns 11833 ns 1.10
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12500 ns 12750 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14541 ns 14333 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12687.5 ns 13375 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 51771 ns 51587 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26500 ns 26375 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25583 ns 26583 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26416 ns 26666 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26417 ns 26417 ns 1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 301146.5 ns 302777.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180292 ns 178666.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180542 ns 180292 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185667 ns 184416.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179958 ns 179709 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55412 ns 55677 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 586395.5 ns 591146.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 587271 ns 588583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 583750 ns 593062 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583562.5 ns 582708.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 283263 ns 285027 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7083 ns 5667 ns 1.25
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5958 ns 7167 ns 0.83
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8187.5 ns 7895.5 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5959 ns 7291 ns 0.82
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70367 ns 69657.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14500 ns 14167 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14459 ns 14958 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15417 ns 15854.5 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14250 ns 14583 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 466069 ns 460443 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1219208 ns 1194208.5 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1347625 ns 1216792 ns 1.11
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1286250 ns 1262604 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1314000 ns 1318166.5 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301441 ns 301559 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4137938 ns 4098416 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4337750 ns 4352937.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4589792 ns 4631875 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4675291 ns 4436562.5 ns 1.05
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1048315 ns 1042661.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1750 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23638.5 ns 23523 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4958 ns 4792 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4834 ns 4875 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4916 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4916 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 189956.5 ns 187370 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6375 ns 5500 ns 1.16
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6166.5 ns 6334 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7917 ns 8604 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6291.5 ns 7292 ns 0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 53487.5 ns 54466 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11292 ns 10958 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10834 ns 11792 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11709 ns 11708.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11458 ns 11166 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 327436.5 ns 330839 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23234.5 ns 22873.5 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns 2708 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2709 ns 2959 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2959 ns 3042 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 2750 ns 1.05
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 160479.5 ns 157537.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14208 ns 10750 ns 1.32
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11709 ns 13708 ns 0.85
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15833 ns 14958 ns 1.06
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11854 ns 14583 ns 0.81
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 55512.5 ns 55574.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25062.5 ns 25209 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24167 ns 25250 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25416 ns 25375 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24917 ns 24979.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 288292 ns 292656 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4166 ns 4208 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4125 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4167 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24783 ns 24774 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16209 ns 16333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16250 ns 16125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16208 ns 16125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16125 ns 16084 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 197133 ns 195031.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5667 ns 5750 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5750 ns 5750 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5750 ns 5709 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33101 ns 33326 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20916 ns 21125 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20750 ns 20875 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21542 ns 21583 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20833 ns 21500 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 175254.5 ns 175195.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 399125 ns 415708 ns 0.96
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 370396 ns 376667 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 496375 ns 471499.5 ns 1.05
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 503625 ns 523500 ns 0.96
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66515.5 ns 66680.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 991583.5 ns 924750.5 ns 1.07
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 871520.5 ns 849291 ns 1.03
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1231125 ns 1217521 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1330834 ns 1302292 ns 1.02
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 189653 ns 189339 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80500 ns 79792 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 79416 ns 82667 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85042 ns 84208 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80708 ns 82833 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192887 ns 193132 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1919750 ns 1917625.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1915625 ns 1915292 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1925188 ns 1940917 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1929833 ns 1896541 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 390355 ns 395963 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21755 ns 21798 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1791 ns 1875 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 167888 ns 167505 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7333.5 ns 5834 ns 1.26
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7125 ns 7500 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9292 ns 9958 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7292 ns 6875 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 55708 ns 58244.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9375 ns 9375 ns 1
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8833 ns 9333 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9208 ns 9354.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9500 ns 9625 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 296941.5 ns 302935 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120225916.5 ns 119443416.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173999042 ns 173896250 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147946708 ns 155811625 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104239875 ns 108054541 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5472185 ns 5469386 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 615397103.5 ns 616746166.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 556088583 ns 555745625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 451781708 ns 468855125 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 773212500 ns 760571396 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34986769 ns 34956216 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 652621166 ns 648663875 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 666689625 ns 664591146 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 587828354 ns 601178041.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 742101875 ns 746069334 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59458 ns 59458 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47625 ns 47083 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47875 ns 39166 ns 1.22
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84208 ns 83208 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37180 ns 37582 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1716167 ns 1926708 ns 0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1973500 ns 1983042 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1980542 ns 1986937.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1912021 ns 1850250 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 172659 ns 173017.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267500 ns 265187.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268541 ns 267959 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 285124.5 ns 276771 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267208 ns 266917 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 122124.5 ns 128834.5 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 675208.5 ns 604083 ns 1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 686125 ns 692833.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 692625 ns 705709 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 595813 ns 590291.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 668007 ns 683429 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2110520.5 ns 2195333 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2200312.5 ns 2225625 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2218999.5 ns 2230583 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2164062.5 ns 2183333 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132950.5 ns 133325.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5484729 ns 5480833 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5495166.5 ns 5508958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5513833 ns 5585895.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5576458 ns 5490125 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 723408 ns 766206 ns 0.94
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 643583 ns 646750 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 660084 ns 660250 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 644334 ns 642917 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 655417 ns 647375 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46996 ns 47306 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1817209 ns 1828875 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1755334 ns 1721042 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1724416.5 ns 1665209 ns 1.04
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2099167 ns 2097000 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 223477 ns 223896.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58167 ns 58667 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46042 ns 47750 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45583 ns 38958 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85292 ns 82750 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28503.5 ns 29191 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1997875 ns 2029083.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2084292 ns 2091166 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093020.5 ns 2107249.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2018417 ns 1994854.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 188016.5 ns 190986 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13344438 ns 13371291 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12452083 ns 12436583.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12494604 ns 12675625 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15015250.5 ns 15146959 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 517943 ns 517535.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47201916.5 ns 47259416 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41812125 ns 41746209 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40768708.5 ns 41384750 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 59301625 ns 58440500 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3218709 ns 3203835 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74145812.5 ns 73984667 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68216667 ns 91223791.5 ns 0.75
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90449292 ns 90609938 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 99506521 ns 77234000 ns 1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58750 ns 59000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47291 ns 47417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47167 ns 38917 ns 1.21
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84500 ns 81125 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 46723 ns 47741 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1914208.5 ns 1911646 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1974166 ns 1970541 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1975625 ns 1976417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1910312 ns 1882083 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190300.5 ns 195868.5 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 31677 ns 32615 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 6500 ns 0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6125 ns 6375 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6750 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6584 ns 6375 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 168934.5 ns 176818 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32058 ns 32102 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2625 ns 2625 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2625 ns 2875 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2834 ns 2916 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2791 ns 2625 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 159386 ns 164236.5 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 287071478.5 ns 286096229 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 341626334 ns 339570541 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313518479.5 ns 321242167 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 275420708 ns 271493208 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7107755 ns 7111512 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 997227041 ns 987492667 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 942763125 ns 939040416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 859332667 ns 868433209 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1180396417 ns 1162204042 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34002052 ns 34040446 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1309634583.5 ns 1310851000.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1365354146 ns 1685402625 ns 0.81
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1654700958 ns 1648347125 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1647288250 ns 1310788750 ns 1.26
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1416125 ns 1412625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1460020.5 ns 1412041.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1415625 ns 1424625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1423417 ns 1408334 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127102 ns 128501 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5015083 ns 5028875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5026958.5 ns 5030104 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5019812.5 ns 5062042 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5047500 ns 5014021 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 494309 ns 597004.5 ns 0.83
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 168408291 ns 168008834 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 132051562.5 ns 130299417 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 129591875 ns 148283479 ns 0.87
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 159716042 ns 161948354 ns 0.99
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4883020 ns 5052268 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 830365834 ns 662817209 ns 1.25
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 642499125 ns 492884417 ns 1.30
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 508881375 ns 507367709 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 844604875 ns 678320708 ns 1.25
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16052527 ns 17294527 ns 0.93
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8935625 ns 8884604 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8710770.5 ns 8801959 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7873916.5 ns 8221541.5 ns 0.96
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10411083 ns 10127167 ns 1.03
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1593477 ns 1611762 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36657875 ns 36027125 ns 1.02
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36688708 ns 36933063 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33442646 ns 34547750 ns 0.97
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 40180854 ns 38824854 ns 1.03
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6470804 ns 6452267 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47416.5 ns 47375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47500 ns 47250 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47792 ns 47542 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47459 ns 47333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19446 ns 19020 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50291 ns 50312.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 53000 ns 50500 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50500 ns 50958.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50500 ns 50333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 178483 ns 226580 ns 0.79
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7375 ns 6542 ns 1.13
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7416 ns 7187.5 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8667 ns 9083 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7292 ns 8625 ns 0.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 79017 ns 117383.5 ns 0.67
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10292 ns 9625 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9833 ns 10208 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10959 ns 10333.5 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10208 ns 10209 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 488962 ns 723908.5 ns 0.68
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8250 ns 6083 ns 1.36
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6667 ns 8250 ns 0.81
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9208 ns 9417 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6687.5 ns 8375 ns 0.80
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 84730.5 ns 157024.5 ns 0.54
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12875 ns 13292 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13375 ns 13792 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13375 ns 13708 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13500 ns 12834 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 422778.5 ns 618769 ns 0.68
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1000 ns 1042 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32216.5 ns 32863 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8000 ns 7875 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7917 ns 8000 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8208 ns 8208 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8000 ns 8250 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 191222.5 ns 246953.5 ns 0.77
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23395.5 ns 25062.5 ns 0.93
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23250 ns 23291.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 25041 ns 23542 ns 1.06
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23250 ns 23250 ns 1
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18950 ns 18661 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52084 ns 52625 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52958 ns 52833 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52500 ns 52875 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52708.5 ns 52333 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 230217.5 ns 364018 ns 0.63
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1452958 ns 1403750 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1446334 ns 1451354 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1398770.5 ns 1407542 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1450833 ns 1406458 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195688 ns 196760 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5010500 ns 5023250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5014041 ns 5018687.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4927959 ns 5042125 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5050083 ns 5001750 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 535939 ns 766930 ns 0.70
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3066250 ns 3048708 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2084000 ns 2082646 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2295333 ns 2300125 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4884812.5 ns 4855000 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585473 ns 583278 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24408167 ns 24263250 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18817958 ns 18905459 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18750208 ns 19193375 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37356750 ns 36575416 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3182445 ns 3216229 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33999792 ns 34013563 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28397375 ns 28342229 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27978125 ns 28436750 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42315042 ns 43339875 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144296292 ns 144288959 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 142454750 ns 142279583 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 123902542 ns 126469000.5 ns 0.98
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 167495375 ns 168866000 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22537902 ns 22582893 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1415751333 ns 1275599313 ns 1.11
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1144590833 ns 1058487228.5 ns 1.08
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 680940541 ns 712851209 ns 0.96
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 679303041 ns 668538250 ns 1.02
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118837359 ns 119108875 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74729 ns 83125 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 79104.5 ns 76208 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77500 ns 78125 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 85020.5 ns 72729 ns 1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 189527.5 ns 365097 ns 0.52
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 204042 ns 189959 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 282250 ns 287792 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 291917 ns 268875 ns 1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 293625 ns 189583.5 ns 1.55
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1050472 ns 1559670.5 ns 0.67
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35430937.5 ns 35476167 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35413625 ns 35447729.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32256541.5 ns 32304459 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 41546833.5 ns 40935146 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5844177 ns 5843273 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148391375 ns 147875542 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 152657770.5 ns 152751312.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 135627417 ns 139824437 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 229003479.5 ns 287719375 ns 0.80
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34891742 ns 34882914 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120913500 ns 120880395.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174393833 ns 174358791 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148088167 ns 155429791 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 102952000 ns 106966959 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5489216 ns 5456342 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 469422083 ns 470623375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 467661083 ns 466918000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 440419292 ns 456589562.5 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 759917916 ns 742113834 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32237167 ns 32255425 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 708309833.5 ns 706243291.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 653178271 ns 652697541.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 576002271 ns 591007625 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 865545375 ns 851805375 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1342979.5 ns 1320583.5 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 979124.5 ns 965875 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 976458 ns 736687.5 ns 1.33
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2106791 ns 1944666.5 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 578569 ns 564187.5 ns 1.03
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2944667 ns 2971708.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2608958 ns 2620334 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2626167 ns 2535604 ns 1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3768167 ns 3604083.5 ns 1.05
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1697045 ns 1878347.5 ns 0.90
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6633209 ns 6649958 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6526458 ns 6493042 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6498709 ns 6437479.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4513208 ns 4435750 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7375 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6208 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 5375 ns 1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10542 ns 9916 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25661 ns 25400 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212354.5 ns 213645.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220458 ns 221833 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220875 ns 221250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216000 ns 205875 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 236310 ns 293719.5 ns 0.80
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 314574854.5 ns 301604437.5 ns 1.04
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 222487417 ns 221356625 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 197721500 ns 223278083.5 ns 0.89
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 317671875 ns 312163250 ns 1.02
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7677705 ns 7672763 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1081788271 ns 1078062604.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 918557958.5 ns 896268771 ns 1.02
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 817573542 ns 880668729 ns 0.93
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1180088625 ns 1161143188 ns 1.02
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26749727.5 ns 26517571 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6958 ns 5500 ns 1.27
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6583.5 ns 5750 ns 1.14
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8791.5 ns 9437.5 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6312.5 ns 5875 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 136942.5 ns 201555 ns 0.68
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 7500 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7417 ns 7458 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7375 ns 7750 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 7041.5 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 559843 ns 699933.5 ns 0.80
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23894.5 ns 23724.5 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9125 ns 9208 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9417 ns 9625 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9250 ns 9604.5 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9791 ns 9042 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 191795.5 ns 234828.5 ns 0.82
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 353916.5 ns 351500 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352604 ns 350896 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 351500 ns 354624.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 353834 ns 351708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21305 ns 20984 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 826667 ns 775417 ns 1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 805417 ns 824916 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 827124.5 ns 830958 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 828792 ns 823958 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 254167.5 ns 306663 ns 0.83
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 332542 ns 338083 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 338541 ns 341500 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 451417 ns 443667 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 309709 ns 325667 ns 0.95
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18587 ns 17821 ns 1.04
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 692020.5 ns 696042 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 740708.5 ns 739416.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1027750 ns 1042874.5 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 692021 ns 692645.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 221446.5 ns 273141.5 ns 0.81
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 347854 ns 358458.5 ns 0.97
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 350209 ns 349125 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 435375 ns 431291.5 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 348708 ns 370875 ns 0.94
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22822 ns 22357.5 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 757750 ns 756625 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 744625 ns 744208.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1078291 ns 1073250 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 821875 ns 818125.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 214919 ns 221398.5 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3417 ns 3459 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3667 ns 3541 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3584 ns 3792 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3625 ns 3291 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18532 ns 17956 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4333 ns 4208 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4250 ns 4208 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4292 ns 4416 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4500 ns 4125 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 212438.5 ns 275839.5 ns 0.77
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4250 ns 3792 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3750 ns 3375 ns 1.11
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6709 ns 6750 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4291 ns 6625 ns 0.65
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 163501.5 ns 205448.5 ns 0.80
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 8334 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8166 ns 8459 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8583 ns 8500 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8542 ns 8541 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 983714 ns 1183984 ns 0.83
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203854.5 ns 202625 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209583 ns 210416 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210000 ns 209292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 201667 ns 200000 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34834 ns 34588 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 648000.5 ns 603792 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 620292 ns 670625 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 666375 ns 630958 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 593417 ns 631187.5 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 288050.5 ns 352652 ns 0.82
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 961208 ns 967521 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 937667 ns 927063 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 959875 ns 964437.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1295041 ns 1281853.5 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA 208012 ns 207244 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4450542 ns 4451771 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4458333 ns 4482750 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4312625 ns 4474208 ns 0.96
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6617292 ns 6201166 ns 1.07
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 982525 ns 945549 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3833 ns 3604.5 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3458 ns 3167 ns 1.09
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5333.5 ns 6792 ns 0.79
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3750 ns 3167 ns 1.18
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 164779 ns 233201 ns 0.71
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7583 ns 7500 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 7375 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7167 ns 7291 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7833 ns 7083 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 887301 ns 1014881 ns 0.87
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1637583.5 ns 1602833.5 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1191458.5 ns 1187916 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1370916 ns 1364062 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2374125.5 ns 2343729.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215947 ns 212955.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12337083 ns 12334792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9576709 ns 9602042 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9305083 ns 9404958 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18247292 ns 17966833 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1951736 ns 1949853 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17394459 ns 17347084 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14383042 ns 14365000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14364687.5 ns 14512666 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21105250 ns 21005479.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 136000 ns 89791 ns 1.51
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90333.5 ns 91729.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 93458 ns 94291 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 134291.5 ns 117416.5 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125598 ns 126285 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030104.5 ns 2023917 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2023020.5 ns 2013416.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2027042 ns 2058875 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2053083 ns 2027875 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 972659 ns 1031286 ns 0.94
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 343042 ns 346791.5 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 349374.5 ns 343583.5 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 396791.5 ns 412250 ns 0.96
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 290500 ns 306166 ns 0.95
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15157 ns 16010 ns 0.95
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 704250 ns 702291 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 724709 ns 728979.5 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1023958.5 ns 1025458 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 648417 ns 639875 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 186915 ns 193209 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7292 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 6083 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 5334 ns 1.12
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10334 ns 10000 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33673 ns 33620 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225000 ns 220479.5 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220041 ns 231958 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229792 ns 232041 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220125 ns 220500 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 287431 ns 311751 ns 0.92
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22815 ns 22440 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14459 ns 14500 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14417 ns 14417 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14375 ns 14167 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14208 ns 14291 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 456593.5 ns 468658 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 140625 ns 95166 ns 1.48
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 95270.5 ns 138021 ns 0.69
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 97229 ns 99167 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 136833.5 ns 122458 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125090.5 ns 125691 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1923208 ns 1931875 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1928583.5 ns 1954979 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1922854.5 ns 1946854 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1940500.5 ns 1923729.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 921068 ns 940251.5 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 869000 ns 880500 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 818833.5 ns 815125 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1217542 ns 1172292 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 938229 ns 960167 ns 0.98
lenet(28, 28, 1, 32)/forward/GPU/CUDA 270679 ns 270704 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2817334 ns 2803000 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2503458 ns 2526833 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3351896 ns 3361333 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3415250 ns 3405875 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1555869 ns 1569154 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16354 ns 15146 ns 1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15417 ns 18000 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21604 ns 21666 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15416 ns 18125 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 140857.5 ns 141811.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225833 ns 217083 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 217062.5 ns 229375 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229688 ns 257396 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 261833 ns 215833 ns 1.21
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 617305 ns 635765.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222458 ns 219750 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220354.5 ns 221500 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 226020.5 ns 226021 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221146 ns 223937.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 266148.5 ns 270450 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 558417 ns 509917 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 508937.5 ns 557729 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 563833 ns 549792 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 510625 ns 555791 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1284278 ns 1308245 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 327750 ns 333479 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 333249.5 ns 335541.5 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 389229 ns 437333 ns 0.89
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 297896 ns 319417 ns 0.93
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16594 ns 16583 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 714667 ns 715333 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 729250 ns 730292 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1018708 ns 1025458.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 660938 ns 655792 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 192223 ns 193313 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17458.5 ns 17625 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17667 ns 17625 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22542 ns 20437.5 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17833 ns 18000 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 143390.5 ns 144711.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224250 ns 216667 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212209 ns 224083 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 251666 ns 226625 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218708 ns 223417 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 870740.5 ns 903796 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6500 ns 4625 ns 1.41
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4541 ns 6750 ns 0.67
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7833 ns 7438 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4666 ns 6625 ns 0.70
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 181869 ns 174159.5 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11000 ns 10437.5 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10542 ns 10750 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10584 ns 10770.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11083 ns 10833 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1004419 ns 1024421 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3395.5 ns 3646 ns 0.93
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3625 ns 3334 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6792 ns 5625 ns 1.21
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4125 ns 3500 ns 1.18
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 202818 ns 231660 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7458 ns 7708 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7792 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7625 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7750 ns 7167 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1021526.5 ns 1037611.5 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23737042 ns 23838833 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34849729 ns 33990646 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37801875 ns 41585708 ns 0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 36279417 ns 34896229 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1853795 ns 1839186 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184527250 ns 184662833 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159658833 ns 159634000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146432958.5 ns 151746084 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 422145604.5 ns 415075875 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16509006 ns 16506413 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 427907292 ns 427351833 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 254738437 ns 251624521 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 232050291.5 ns 233926312.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 497717146 ns 484091542 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182792 ns 181666 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183000 ns 183416.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 189042 ns 186125 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 184208 ns 183834 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 166826.5 ns 173529.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 632937.5 ns 587541 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 631145.5 ns 600458 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 629416 ns 632375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 603979 ns 631354 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 995462.5 ns 1005977 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3817645.5 ns 3816041.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3624583 ns 3637833 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3484416 ns 3539646 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5475291 ns 5351396 ns 1.02
batchedmm(128, Bsize=512)/forward/GPU/CUDA 538100 ns 554127 ns 0.97
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17364000 ns 17372333 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17178979 ns 17218458.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16580708.5 ns 16979478.5 ns 0.98
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 23174500 ns 22177625 ns 1.04
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2617022.5 ns 2616933 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31794 ns 32036 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9500 ns 9667 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8709 ns 9750 ns 0.89
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9729.5 ns 10125 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9916.5 ns 9291 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 262089 ns 260858 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 499603584 ns 506491042 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 432109771 ns 428949104 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 428755896 ns 474815000 ns 0.90
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 676396458.5 ns 671461979 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12478713 ns 12484614.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2045189437.5 ns 2043435104.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1633734250 ns 1631358667 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1493383584 ns 1546812271 ns 0.97
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2223823937.5 ns 2216473375.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49038762 ns 49204869.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1639166 ns 1642542 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1203250 ns 1194625 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1395479 ns 1380791 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2461958 ns 2487084 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 219168 ns 215546 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12694958.5 ns 12711687.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9921500 ns 9927625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9744938 ns 9788604.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18522333 ns 18464437.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2024843 ns 1995889.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17671042 ns 17669166.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14707250 ns 14709437.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14647166.5 ns 14807645.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21550271 ns 21465708 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26209 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26458 ns 26250 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26208 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26292 ns 26167 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24425 ns 23873 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66958 ns 66917 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67166 ns 67333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67000 ns 67083 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67042 ns 66833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 390613 ns 382426 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203958 ns 203834 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209500 ns 209542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209084 ns 209584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199708 ns 199584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26539.5 ns 26132 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 612750 ns 613833.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 634104 ns 636667 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 675250 ns 671166.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 590333 ns 628229.5 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 314844.5 ns 308600 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 680396 ns 671687.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 636917 ns 645937.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 650020.5 ns 644791.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 634875 ns 676334 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132029.5 ns 131667 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2232958 ns 2241875 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2236166.5 ns 2192250 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2233334 ns 2297042 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2287187 ns 2246249.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1125648.5 ns 1114838 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17854 ns 16791 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17520.5 ns 17500 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23334 ns 20958 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18583 ns 16770.5 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144775 ns 143001 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229875 ns 230375 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 226666.5 ns 231791.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 270583.5 ns 266208 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 263854 ns 260728.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 959972 ns 959584 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 459 ns 542 ns 0.85
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23699 ns 23163 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9834 ns 9604.5 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9833 ns 10292 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10000 ns 10625 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10250 ns 9584 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 260024.5 ns 255611 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5458.5 ns 5416.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6291 ns 5750 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7875 ns 9458 ns 0.83
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7041.5 ns 5708 ns 1.23
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 189594.5 ns 219432 ns 0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6834 ns 7833 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7750 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7709 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7750 ns 7000 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 744104.5 ns 764584 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2125 ns 1959 ns 1.08
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2145.5 ns 2083 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2209 ns 2417 ns 0.91
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2250 ns 2208 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18220 ns 17893 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6333 ns 6875 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6834 ns 6542 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6625 ns 6583 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6667 ns 6291 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 313257 ns 320459 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 748417 ns 747709 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746854.5 ns 749833 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 749958 ns 754999.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 753104 ns 749375 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 22145 ns 21357 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 793187.5 ns 774854 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 777167 ns 792687.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 818333 ns 817042 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 814666 ns 811166 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 296915 ns 295013.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7334 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6000 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5875 ns 5208.5 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10583 ns 10166 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32972 ns 33519 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 257708 ns 219666 ns 1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227458 ns 268125 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 277979.5 ns 252000.5 ns 1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226000 ns 213562 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 320926.5 ns 354278 ns 0.91
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12000 ns 10875 ns 1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10520.5 ns 11833 ns 0.89
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13291 ns 12770.5 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11084 ns 12000 ns 0.92
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 209902.5 ns 238132.5 ns 0.88
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24625 ns 24708 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24708 ns 24584 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24458 ns 25292 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25667 ns 24500 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1060096 ns 1094067.5 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106026958 ns 106709834 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 116938875 ns 116906583.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120020687.5 ns 127036729 ns 0.94
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 118466791 ns 117807000 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2643754 ns 2657653 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 393860625 ns 392558792 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366891542 ns 365774917 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 423298521.5 ns 431860937.5 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 490615500 ns 483379250 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15188440 ns 15196086 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 758114312.5 ns 758564875.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 580813833 ns 761412666 ns 0.76
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 744661000 ns 748747542 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 956905604 ns 765232583 ns 1.25
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7709 ns 6625 ns 1.16
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7208 ns 7334 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8250 ns 9041.5 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7667 ns 8250 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 222535 ns 231038.5 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13750 ns 14625 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14417 ns 14750 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14625 ns 14292 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14729.5 ns 14542 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1001840 ns 1043294.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8208 ns 5875 ns 1.40
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6792 ns 7959 ns 0.85
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9417 ns 9167 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7291 ns 6333 ns 1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 221347 ns 228571 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12250 ns 12791 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12542 ns 13167 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13292 ns 13375 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13375 ns 12333 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 753126 ns 779066.5 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 341666 ns 347625 ns 0.98
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 340292 ns 342625 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 397312.5 ns 416812 ns 0.95
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 287375 ns 307083 ns 0.94
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16899 ns 17023 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 702916 ns 710208.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 729771 ns 732125 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1023187.5 ns 1032542 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 659396 ns 653979.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 198658 ns 200196.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23239 ns 23569 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6667 ns 6375 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6583 ns 6584 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6666 ns 6834 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6042 ns 1.08
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 240107.5 ns 241926 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 5834 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5916 ns 5875 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5708 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24262 ns 24556.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21625 ns 21562.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21625 ns 22000 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 23834 ns 21709 ns 1.10
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21750 ns 21167 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 263977 ns 265433.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 189250 ns 144917 ns 1.31
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 186520.5 ns 191292 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149208 ns 149333 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 184499.5 ns 149250 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166758.5 ns 167659 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1324770.5 ns 1319292 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1331916 ns 1331416 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1325208 ns 1362958 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1369687.5 ns 1326125 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1289191 ns 1343729.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24709 ns 22250 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22292 ns 23791 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28459 ns 25875 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22646 ns 23666.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 280466 ns 286115 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 176166.5 ns 146125 ns 1.21
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 131417 ns 118500 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 177541.5 ns 129833 ns 1.37
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 131708 ns 175792 ns 0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1404050 ns 1461317 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23025 ns 23352 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6500 ns 6334 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6417 ns 6459 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6709 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6125 ns 1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 256062 ns 258095.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6458 ns 4625 ns 1.40
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4292 ns 4125 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7541 ns 7625 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5000 ns 4895.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 245303.5 ns 256357.5 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10542 ns 9959 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9916 ns 10125 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10458 ns 10333 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10333 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1323518 ns 1358318.5 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23190 ns 23389 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5667 ns 5667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5667 ns 5875 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5875 ns 6000 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5917 ns 5625 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 276691 ns 275350.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6810792 ns 6780125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6374000 ns 6371125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6503375 ns 6531396 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7605187.5 ns 7625875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215181 ns 214804 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24053666.5 ns 24015354 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21276917 ns 21285667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21121125 ns 21085125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29932209 ns 29769250 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2101536 ns 2112477.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37203604.5 ns 37264541.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34199125 ns 45538167 ns 0.75
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45694750 ns 45665125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49486854 ns 38235958 ns 1.29
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6208 ns 6208 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6500 ns 5958.5 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8167 ns 8750 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6604 ns 7500 ns 0.88
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 225712.5 ns 236550 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8084 ns 8750 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8584 ns 8375 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8500 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8958 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1021276 ns 1063848.5 ns 0.96
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1561583 ns 1554084 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1270125 ns 1262375 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1633937.5 ns 1631958.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2125417 ns 2152375 ns 0.99
lenet(28, 28, 1, 128)/forward/GPU/CUDA 271546.5 ns 277465 ns 0.98
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7898042 ns 7881667 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6597312.5 ns 6612667 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7213854.5 ns 7276167 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10525771 ns 10468062.5 ns 1.01
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1821413.5 ns 1876576 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 336125 ns 346375 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 344083 ns 348937.5 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 411833 ns 423416.5 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 315250 ns 336687 ns 0.94
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46192.5 ns 46390 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 748292 ns 735208 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 784375 ns 782458 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1075167 ns 1081666.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 770125 ns 758458.5 ns 1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 238566 ns 311011.5 ns 0.77
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397417 ns 397375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287875 ns 288250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288084 ns 212583 ns 1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751167 ns 754104.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43477 ns 44494 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 667584 ns 675959 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 532083 ns 532333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 532417 ns 474000 ns 1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974750 ns 973417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 189406.5 ns 189847 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 657500 ns 599375 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 675000 ns 650333 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 680771 ns 660375 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 645958 ns 655833.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131425.5 ns 132321 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2458959 ns 2469395.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2458479.5 ns 2363959 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2467396 ns 2519875.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2524354.5 ns 2465916 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1178122 ns 1345989 ns 0.88
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 342958 ns 345583 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 346625 ns 342834 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 394000 ns 416375 ns 0.95
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 290792 ns 306979.5 ns 0.95
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15424 ns 16330 ns 0.94
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 704437.5 ns 703104 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 724583.5 ns 729708 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1022792 ns 1026937.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 653041.5 ns 645959 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 194499 ns 199885.5 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1459417 ns 1460542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1497667 ns 1500583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499542 ns 1491791 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1445708 ns 1441917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40504 ns 41671 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5115041.5 ns 5133500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5286250 ns 5293250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5287729 ns 5309521 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5025229 ns 4977042 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195999 ns 197710 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3666 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34443 ns 33362 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15167 ns 15125 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15083 ns 15500 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15334 ns 15125 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15083 ns 15083 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 366879.5 ns 381216.5 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71208 ns 71375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71291 ns 71208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71125 ns 71583 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71583 ns 71208 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113688 ns 113946.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317167 ns 319833 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 317666 ns 319208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 317583 ns 327125 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 322750 ns 318375 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 194647 ns 195156 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 959 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 959 ns 1042 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23307.5 ns 23764 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8166 ns 8084 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8041.5 ns 8542 ns 0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8416 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 7833.5 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 258638.5 ns 263039 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 465979.5 ns 472416 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 463208 ns 468125 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 554208.5 ns 549250 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 539083.5 ns 550333 ns 0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129729 ns 128804.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1382208 ns 1375292 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1370959 ns 1372208 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1642854.5 ns 1633459 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1636459 ns 1580500 ns 1.04
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 273889 ns 274739 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31083 ns 31574 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6542 ns 6458 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6875 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6666 ns 6708 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6375 ns 6000 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 260736 ns 261869 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1729458 ns 1727625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1779500 ns 1783958 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1782166 ns 1730916 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1727271 ns 1729333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168880.5 ns 168455 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4355833 ns 4352625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4361417 ns 4372937.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4372145.5 ns 4412458 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4411083 ns 4358042 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1156995.5 ns 1234725 ns 0.94
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7625 ns 6709 ns 1.14
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7083 ns 6584 ns 1.08
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6750 ns 7417 ns 0.91
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7167 ns 6542 ns 1.10
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19813 ns 19619.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 70708 ns 51083 ns 1.38
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 51167 ns 35625 ns 1.44
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 51917 ns 49875 ns 1.04
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 52937 ns 70208 ns 0.75
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 208382 ns 211156 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 347750 ns 354291 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 345959 ns 347584 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 409333 ns 432708 ns 0.95
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 300874.5 ns 319521.5 ns 0.94
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18451.5 ns 18053 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 719479.5 ns 719104 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 731083.5 ns 735979 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1027042 ns 1039063 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 676020.5 ns 672750 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 329987.5 ns 343671.5 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75334 ns 75417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75354.5 ns 75333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75459 ns 75708 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75625 ns 74709 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46972 ns 46983 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324916 ns 324417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 327208 ns 327000 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 324041 ns 334917 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 334959 ns 324083 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 211223 ns 207721.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1485291 ns 1486334 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1526250 ns 1527500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1527875 ns 1519000 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1468708 ns 1466541 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51383 ns 51914 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5122729.5 ns 5119333.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5285958 ns 5300396 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5294688 ns 5303708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5003000 ns 4989375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 201671.5 ns 201413 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28125 ns 28166 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28208 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28208 ns 28208 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24621 ns 24393 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66458 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66625 ns 66292 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66292 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66500 ns 66584 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 506316 ns 530998 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1489750 ns 1493250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1142833 ns 1120167 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1115125 ns 947625 ns 1.18
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2175167 ns 2256500 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 574530.5 ns 570331 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3074667 ns 3075542 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2740583.5 ns 2732479 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2749271.5 ns 2643125 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3881042 ns 3814770.5 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1973310 ns 2010818 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8822500 ns 8738917 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8769542 ns 8777854.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8776625 ns 8781417 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6375666 ns 6360687.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84125 ns 81146 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80084 ns 81708.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86584 ns 83708 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83333 ns 87687.5 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 191975 ns 192383.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020083 ns 2016791.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2017792 ns 2012708 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2026459 ns 2041312 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2029041.5 ns 2015208 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 735374 ns 798885.5 ns 0.92

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant