OpenCL verbose output and documentation, improved auto-tuning scripts, minor fixes after #419 #425

hfp · 2021-02-04T13:47:45Z

OpenCL-BE/LIBSMM: verbose output and documentation. Improved auto-tuning scripts. Minor fixes after #419.

Fixed Makefile used to build acc_bench_trans/acc_bench_smm with CUDA (accommodate changes from Simplify the CMake ROCm detection #419).
Introduced (runtime-)verbosity level. Print device name (non-zero verbosity).
Fixed issue (Simplify the CMake ROCm detection #419 (comment)).
Renamed ACC_OPENCL_VERBOSE to ACC_OPENCL_DEBUG.
ACC benchmark drivers: inform if no device was found.
Improved documentation and documented ACC_OPENCL_VERBOSE.
Introduced verbose output (time needed for kernel compilation, etc).
tune_multiply.py: option to only rely on primary objective.
tune_multiply.py: catch CTRL-C and save configuration.
tune_multiply.sh: relay result code of failing script.
tune_multiply.sh: continuation with wrapper script.

…ing scripts. Minor fixes after cp2k#419. * Introduced (runtime-)verbosity level. Print device name (non-zero verbosity). * Fixed issue (cp2k#419 (comment)). * Renamed ACC_OPENCL_VERBOSE to ACC_OPENCL_DEBUG. * ACC benchmark drivers: inform if no device was found. * Improved documentation and documented ACC_OPENCL_VERBOSE. * Introduced verbose output (time needed for kernel compilation, etc). * tune_multiply.py: option to only rely on primary objective. * tune_multiply.py: catch CTRL-C and save configuration. * tune_multiply.sh: relay result code of failing script. * tune_multiply.sh: continuation with wrapper script.

…(accommodate changes from cp2k#419).

src/acc/cuda/Makefile

…ze in a parallel region which makes this code ineffective.

hfp · 2021-02-04T14:25:24Z

There will be one more change for this PR, which disables ACC_OPENCL_THREADLOCAL_CONTEXT feature in OpenCL backend.

hfp · 2021-02-04T14:34:15Z

I will also try enabling runtime tests for the OpenCL backend on Daint-CI.

hfp · 2021-02-04T14:45:51Z

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:

20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]

... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.

alazzaro · 2021-02-04T14:46:06Z

For some reason, Daint-CI cannot run a test in 900s. @hfp could you try to increase the time limit (1200s?) ?

alazzaro · 2021-02-04T14:46:49Z

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:
20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]
... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.

Ah, good spot!

hfp · 2021-02-04T14:49:06Z

For some reason, Daint-CI cannot run a test in 900s. @hfp could you try to increase the time limit (1200s?) ?

Let me try fixing it as part of this PR.

hfp · 2021-02-04T15:17:50Z

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:
20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]
... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.
Ah, good spot!

It seems to come from tests/libsmm_acc_unittest_multiply.cpp.template, i.e., subsequent calls contained there like libsmm_acc_benchmark. I will simply increase permitted runtime to 1200s (perhaps we never built/ran these tests that way?).

haampie · 2021-02-04T15:20:29Z

Not sure if relevant, but these particular tests are executed without the srun ... prefix is what I found curious when I looked into the cmake code

alazzaro · 2021-02-04T15:20:58Z

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:
20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]
... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.
Ah, good spot!
It seems to come from tests/libsmm_acc_unittest_multiply.cpp.template, i.e., subsequent calls contained there like libsmm_acc_benchmark. I will simply increase permitted runtime to 1200s (perhaps we never built/ran these tests that way?).

Well, no idea.. But I would suggest commenting the output (leave it only for Debug) if it makes the test running faster... Daint-CI has a limited budget for us...

alazzaro · 2021-02-04T15:22:55Z

Not sure if relevant, but these particular tests are executed without the srun ... prefix is what I found curious when I looked into the cmake code

Really? This is a GPU test, cannot run on the frontend node... We provide srun via cmake, i.e.:

-DMPIEXEC_EXECUTABLE="$(command -v srun)" \

haampie · 2021-02-04T15:27:10Z

Yeah, I was aware, earlier this morning I learned that sbatch executes the bash script on the compute node in fact, so therefore it works. Which made me think it was intentional to not wrap these tests in [srun command] [test]

dev-zero · 2021-02-04T15:42:50Z

Yeah, I was aware, earlier this morning I learned that sbatch executes the bash script on the compute node in fact, so therefore it works. Which made me think it was intentional to not wrap these tests in [srun command] [test]

Given that the libsmm_acc tests all directly use printf without checking whether they are on the first node, my guess is that they were intended to be run on single-node only.
That should be rectified I guess (both the code and the CMake command).

Maybe add a wrapper for add_test(...) which contains the if (USE_MPI) to keep it DRY?

codecov · 2021-02-04T15:47:43Z

Codecov Report

Merging #425 (ea4a9aa) into develop (dcbc5f6) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           develop    #425   +/-   ##
=======================================
  Coverage     63.1%   63.1%           
=======================================
  Files           86      86           
  Lines        25625   25625           
=======================================
  Hits         16190   16190           
  Misses        9435    9435

Flag	Coverage Δ
unittests	`63.1% <ø> (ø)`
with-blas	`63.1% <ø> (ø)`
with-libxsmm	`63.2% <ø> (ø)`
with-mpi	`63.6% <ø> (ø)`
with-openmp	`62.3% <ø> (ø)`
without-mpi	`59.4% <ø> (ø)`
without-openmp	`62.3% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcbc5f6...2e57682. Read the comment docs.

hfp · 2021-02-04T16:04:05Z

Yeah, I was aware, earlier this morning I learned that sbatch executes the bash script on the compute node in fact, so therefore it works. Which made me think it was intentional to not wrap these tests in [srun command] [test]

Given that the libsmm_acc tests all directly use printf without checking whether they are on the first node, my guess is that they were intended to be run on single-node only.
That should be rectified I guess (both the code and the CMake command).

Maybe add a wrapper for add_test(...) which contains the if (USE_MPI) to keep it DRY?

If my "hack" (reducing output) goes through with Daint-CI, I will leave this to another PR.

dev-zero · 2021-02-04T16:18:29Z

sure, I just opened #427 to track it

alazzaro · 2021-02-04T19:35:06Z

I wonder why the test cannot complete in 900s anymore. This is the last week output:

https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/

The infamous test complete in 629.62s.

hfp · 2021-02-04T19:38:15Z

I wonder why the test cannot complete in 900s anymore. This is the last week output:

https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/

The infamous test complete in 629.62s.

Perhaps a diff between the build-logs can help if this is caused by compiler flags or such.

alazzaro · 2021-02-04T19:43:34Z

Well the only changes since last week and now are due to the HIP PR...
@haampie could you take a look and make a diff between the two versions?

hfp · 2021-02-04T20:35:20Z

Well the only changes since last week and now are due to the HIP PR...
@haampie could you take a look and make a diff between the two versions?

I just had a look at libsmm_acc_unittest_multiply and how the compile/link line changed from 679 and 685. Compilation flags are the same except -Werror. The main difference is previously (679) everything possible was linked statically. Further, 679 was relying on /opt/nvidia/cudatoolkit10.2/10.2.89_3.28-7.0.2.1_2.17__g52c0314/targets/x86_64-linux/lib/stubs whereas 685 used the non-stubs library directory (probably related to static linkage?). Also, 679 linked against CMakeFiles/libsmm_acc_unittest_multiply.dir/cmake_device_link.o, libcudadevrt, librt, and libdl (which are not present on 685's link-line). However, 685 links against libnvToolsExt, which was not present on 679's link-line.

This reverts commit 9598017.

alazzaro · 2021-02-04T20:40:19Z

Well the only changes since last week and now are due to the HIP PR...
@haampie could you take a look and make a diff between the two versions?

I just had a look at libsmm_acc_unittest_multiply and how the compile/link line changed from 679 and 685. Compilation flags are the same except -Werror. The main difference is previously (679) everything possible was linked statically. Further, 679 was relying on /opt/nvidia/cudatoolkit10.2/10.2.89_3.28-7.0.2.1_2.17__g52c0314/targets/x86_64-linux/lib/stubs whereas 685 used the non-stubs library directory (probably related to static linkage?). Also, 679 linked against CMakeFiles/libsmm_acc_unittest_multiply.dir/cmake_device_link.o, libcudadevrt, librt, and libdl (which are not present on 685's link-line). However, 685 links against libnvToolsExt, which was not present on 679's link-line.

Wait, libnvToolsExt is used? This is the profiler API that can explain the reason for the slowness...
However, the code should be protected with a macro...

Update: the macro __CUDA_PROFILING is not set... it needs a more appropriate analysis...

hfp · 2021-02-05T15:05:33Z

Can you elaborate on "manually tested on Daint (user account)"?

I experimented a bit and maybe CUDA MPS is related. At least srun --ntasks 1 ... (or equivalent salloc) get things working. I ran multiple MPI-ranks on different systems with CUDA devices but in none of the cases MPS was enabled. I will downgrade the runtime tests for the OpenCL backend to single-rank as a first try.

… up with cp2k#428.

hfp · 2021-02-06T07:48:16Z

Only CodeCov "fails", i.e., OpenCL runtime tests are now passing with one rank only (see here for some early reasoning). It seems Daint runs the GPUs in "exclusive" mode which seems to only permit multiple ranks via MPS, i.e., creating a second OpenCL context on the same device (single node!) always fails like "device not available".

( As a side-note, it seems running single-ranks omits some of the tests perhaps of too much CMake magic? )

dev-zero · 2021-02-06T10:43:43Z

wrt CodeCov: the total coverage consists of different runs/uploads (with, w/o mpi; with w/o omp), and the PR entry often happens too early. We could tune it to wait for all uploads before posting or create new posts instead of partially updating the existing one.

wrt single-ranks: there is no extra CMake magic involved when it comes to deciding when to run tests, but the following determines the number of ranks at build time:

    -DTEST_MPI_RANKS=${SLURM_NTASKS} \
    -DTEST_OMP_THREADS=${SLURM_CPUS_PER_TASK} \

As mentioned in #427, the libsmm_acc tests are currently not run through srun (or mpirun) at all.

hfp · 2021-02-06T19:42:15Z

Thank you for following up, Tiziano!

wrt CodeCov: the total coverage consists of different runs/uploads (with, w/o mpi; with w/o omp), and the PR entry often happens too early. We could tune it to wait for all uploads before posting or create new posts instead of partially updating the existing one.

I wonder if we should drop CodeCov? I think it produces mostly noise (nothing actionable). Primarily, we are happy if the project receives a contribution, secondly, we only take PRs and then review/guide the code and format. I believe this process would never pass something unreasonable into the code base. Overall I have not seen anyone of us judging a contribution by the percentage of coverage. With the current code base, it seems a coverage of ~60% is intrinsic to the project or code base. Perhaps it is even possible to just calculate the percentage (but no threshold making things red and sending emails).

wrt single-ranks: there is no extra CMake magic involved when it comes to deciding when to run tests, but the following determines the number of ranks at build time:
    -DTEST_MPI_RANKS=${SLURM_NTASKS} \
    -DTEST_OMP_THREADS=${SLURM_CPUS_PER_TASK} \
As mentioned in #427, the libsmm_acc tests are currently not run through srun (or mpirun) at all.

I just noticed that the OpenCL runtime test not only passes (see above for the root cause), but unfortunately ran less tests (19) compared to other (23) runtime tests. The only difference seems to be the rank count. I understand that some tests are currently not running per srun, but should this not apply to all runtime tests equally no matter what the requested rank-count was?

# Conflicts: # .ci/daint.cscs.ch/ocl.build.sh # .ci/daint.cscs.ch/ocl.test.sh

alazzaro · 2021-02-08T08:19:30Z

@hfp I assume it is ready for review, I will do it today...

I wonder if we should drop CodeCov?

We should keep it, actually I plan to introduce new tests (long project) and add the coverage for the GPU. I like @dev-zero idea...

It seems Daint runs the GPUs in "exclusive" mode which seems to only permit multiple ranks via MPS, i.e., creating a second OpenCL context on the same device (single node!) always fails like "device not available".

Should we open a ticket with CSCS? I can try on some other machines, I wonder what's the rationale behind that... I can assume not so many people are using OpenCL.

unfortunately ran less tests (19) compared to other (23) runtime tests.

Interesting... I actually see the same number of tests...

dev-zero · 2021-02-08T08:20:31Z

I wonder if we should drop CodeCov? I think it produces mostly noise (nothing actionable). Primarily, we are happy if the project receives a contribution, secondly, we only take PRs and then review/guide the code and format. I believe this process would never pass something unreasonable into the code base. Overall I have not seen anyone of us judging a contribution by the percentage of coverage. With the current code base, it seems a coverage of ~60% is intrinsic to the project or code base. Perhaps it is even possible to just calculate the percentage (but no threshold making things red and sending emails).

The "sending emails" part is done automatically by GH when the CodeCov-bot adds the comment.
I agree that it should be tweaked further to generate less noise, but if we only let it report a percentage in the list of tests without further note, nobody is going to look at it anymore at all.

So, I would propose that I change it to wait for all reports before sending it first and changing the relative threshold to something which allows to add some uncovered fixes without error but when adding larger parts they have to be covered, and a lower bound of 60% because that's what we established now.

I just noticed that the OpenCL runtime test not only passes (see above for the root cause), but unfortunately ran less tests (19) compared to other (23) runtime tests. The only difference seems to be the rank count. I understand that some tests are currently not running per srun, but should this not apply to all runtime tests equally no matter what the requested rank-count was?

Are you sure it's not this line here:

dbcsr/tests/CMakeLists.txt

Line 216 in dcbc5f6

if (USE_ACCEL MATCHES "cuda|hip")

... which should probably be just if (USE_ACCEL) again.

hfp · 2021-02-08T08:22:10Z

Regarding multiple ranks on a single card, I guess this can be adjusted with the SMI utility and there is likely no performance regression associated with toggling it from exclusive to shared (and MPS can do whatever it does). However, this is reconfiguring Daint and may not be a realistic request. MPS on the other does not support OpenCL (as per some answer in some Nvidia forum). Another option could be to test multiple ranks using a single rank per node and multiple nodes, which should work.

hfp · 2021-02-08T08:32:21Z

which should probably be just if (USE_ACCEL) again.

Thanks for finding, I overlooked this completely! I guess it should stay since these tests are specific to CUDA/HIP and the implementation auto-tuning for this backend. If that was written using the ACC interface it would have been easier, but on the other hand we can play with the OpenTuner based auto-tuning and perhaps it is fruitful to have something else.

hfp · 2021-02-08T09:19:09Z

If this passes, I probably want to add an error message explaining a failure when creating the OpenCL context or at least hint the MPS/SMI exclusive thing.

alazzaro · 2021-02-08T09:20:31Z

If this passes, I probably want to add an error message explaining a failure when creating the OpenCL context or at least hint the MPS/SMI exclusive thing.

OK, please let me know when it is time to review...

hfp · 2021-02-08T09:57:04Z

@alazzaro the PR is ready for review.

alazzaro · 2021-02-08T09:58:18Z

@hfp For my understanding is the solution proposed here

https://gerrit.gromacs.org/c/gromacs/+/5729/
https://gerrit.gromacs.org/c/gromacs/+/5780/7

I cannot find anything else on the topic (OpenCL with multiprocesses)...

hfp · 2021-02-08T10:11:17Z

@hfp For my understanding is the solution proposed here

https://gerrit.gromacs.org/c/gromacs/+/5729/
https://gerrit.gromacs.org/c/gromacs/+/5780/7

I cannot find anything else on the topic (OpenCL with multiprocesses)...

I found different stuff including to prefer clCreateContextFromType ("due to a bug") over clCreateContext. I tried this and only got a different error code like CL_DEVICE_NOT_AVAILABLE rather than CL_INVALID_DEVICE. Your finding looks more promising and potentially more applicable. I wonder if this would only affect the backend's code base or if this is a more general change? Currently, the BE code does not use MPI or broadcasts/shares anything. From the BE's perspective, all device discovery is process-specific and local.

alazzaro · 2021-02-08T10:15:13Z

OK, then let's keep a single rank per device and open another issue for further investigation, probably employing the intranode broadcast. It seems that this is a common issue, other codes (CENAERO and SPECFEM3D) have similar problems, but Gromas seems to have found a solution...
I will review today.

hfp · 2021-02-08T10:22:26Z

I checked on a Daint/GPU node like srun nvidia-smi -q -d COMPUTE" and it shows:

==============NVSMI LOG==============

Timestamp                           : Mon Feb  8 11:20:52 2021
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:02:00.0
    Compute Mode                    : Exclusive_Process

hfp · 2021-02-08T10:30:05Z

probably employing the intranode broadcast.

They did a broadcast initially and the fix was to discover devices on a per rank basis. We already do the right thing then and it seems to be all about the exclusive mode, but let's investigate over the course of today...

alazzaro · 2021-02-08T13:05:16Z

LGTM, Daint-CI with Intel failed to submit the job. I've restarted the job.

So, I suggest opening an issue for the OpenCL and multiple ranks per device, as we discussed (unless you have a very last-minute solution, I would keep the discussion for a different PR).

I can have some naive questions (apologies in advance):

Can we use OpenCL and CUDA at the same time? Let me explain better what I have in mind: let's assume we are running on nvidia gpu with OpenCL and we ask for a kernel that is not implemented, can we fall-back to the CUDA side? I can assume it will be a mess to organize memory and so on... Another possible solution would be to have OpenCL for single precision (great that you have implemented it) and CUDA for DP (but of course we can go for all with OpenCL...)
Concerning the SP part, we probably have to hack the fortran code to allow the SP part for ACC. I can remember (but maybe I'm wrong) that it is only DP support in several places...
What about the CUBLAS backend, is there anything similar in OpenCL?
Do you have a performance comparison CUDA/OpenCL or HIP/OpenCL?

alazzaro · 2021-02-08T14:12:05Z

New tests on Daint-CI wenf all fine...

hfp · 2021-02-08T14:13:46Z

LGTM, Daint-CI with Intel failed to submit the job. I've restarted the job.

I will merge asap.

So, I suggest opening an issue for the OpenCL and multiple ranks per device, as we discussed (unless you have a very last-minute solution, I would keep the discussion for a different PR).

ACK

Thank you for the good questions! I will try answer them one by one (see below).

Can we use OpenCL and CUDA at the same time? Let me explain better what I have in mind: let's assume we are running on nvidia gpu with OpenCL and we ask for a kernel that is not implemented, can we fall-back to the CUDA side? I can assume it will be a mess to organize memory and so on... Another possible solution would be to have OpenCL for single precision (great that you have implemented it) and CUDA for DP (but of course we can go for all with OpenCL...)

Of course, it's doable, but likely quite some effort since we never accounted for multiple backends.

Concerning the SP part, we probably have to hack the fortran code to allow the SP part for ACC. I can remember (but maybe I'm wrong) that it is only DP support in several places...

Joost quite some time ago (obviously) confirmed SP at least for CPU/LIBXSMM (out of the box), but I cannot remember what/how he tested. Since a lot of code in CP2K is hard-coding DP, it was probably just plain/straight-forward QS test (maybe Water ;-). For SP in general, we probably want to gather workforce and identify if/where this is beneficial and what to enable first. This also touches enabling low/mixed-precision with SP as viable special case. DBCSR stand-alone "workloads" and all (unit-)tests already work with SP. On DBCSR's side, we can think if something else should offered like "internal type" w/ copy-in/out from higher precision, etc. Maybe like the feature about reduced MPI-traffic based on SP.

What about the CUBLAS backend, is there anything similar in OpenCL?

I have implemented SVM support already, i.e., LIBSMM can leverage calls to some other implementation like MKL. This might be similar to the CUDA/HIP/SMM_ACC (except they need this host-side stack; not sure if other GEMM-args are fed from the host or if this also relies on device-pointers like SVM). So far, this might be useful for (very) large kernels, but otherwise (CUDA/HIP) this is inefficient since the stack is processed on the host with all singular SMMs queued on the device-side. I would rather like to call dgemm_batch, however, this does not specify what happens for duplicated C-indexes (beside of taking an array of pointers). I may ask MKL team to revisit such case. Btw, LIBXSMM implements parallelized batches SMMs with C-matrix sync (which is used by benchmark drivers). To somwwhat answer your original question, calling MKL for instance with data already populated on the device will be easy given SVM is enabled (currently disabled in OpenCL backend).

Do you have a performance comparison CUDA/OpenCL or HIP/OpenCL?

At the moment, I believe HIP/OpenCL should deliver the best insight with respect to what this backend is capable of. @haampie wanted to look at some AMD GPUs; perhaps I can help/do this as well (if accessible per my CSCS account). Otherwise (Nvidia devices), the gap using OpenCL vs CUDA is reaching up to 2x with current kernels. I will improve kernels, which are by no means sophisticated yet on the OpenCL side. To get to your question, you can send me an email; I have numbers for P100, V100, A100, and perhaps others.

alazzaro · 2021-02-08T14:19:45Z

Please merge and thanks for the replies!
One point to keep in mind concerning SP: of course DBCSR allows SP and DP (real and complex). However, only DP is available on the GPU. In the Fortran code, we have some flags to check for that... What Joost did was only for the CPU...

hfp added 2 commits February 4, 2021 14:25

Fixed Makefile used to build acc_bench_trans/acc_bench_smm with CUDA …

69a0e85

…(accommodate changes from cp2k#419).

haampie reviewed Feb 4, 2021

View reviewed changes

src/acc/cuda/Makefile Show resolved Hide resolved

Disabled ACC_OPENCL_THREADLOCAL_CONTEXT since DBCSR calls init/finali…

3bf854c

…ze in a parallel region which makes this code ineffective.

hfp added 2 commits February 4, 2021 15:40

Updated LIBXSMM prior to v1.17.

f14eee4

Attempt to runtime-test OpenCL BE/LIBSMM.

9598017

Reduced console output to potentially improve runtime of (CI-)tests.

6721baa

hfp added 2 commits February 4, 2021 19:58

Increased timeout from 15m to 20m.

ee12e07

Fetch all commits before referring to some SHA.

8baf7ae

Revert "Attempt to runtime-test OpenCL BE/LIBSMM."

2b9335f

This reverts commit 9598017.

hfp added 4 commits February 5, 2021 16:21

Try to avoid MPS issue (temporarily) testing with only one rank. Sync…

b5cb129

… up with cp2k#428.

Enabled OpenCL based runtime tests.

367a117

Fixed CI-scripts.

6c9f84c

Fixed another variable which was left unbound (CI-script).

9ff03d9

alazzaro self-requested a review February 8, 2021 08:04

hfp added 2 commits February 8, 2021 09:07

Incorporated cp2k#428.

d17f03f

Merge branch 'develop' of https://github.com/cp2k/dbcsr into oclverbose

1265b26

# Conflicts: # .ci/daint.cscs.ch/ocl.build.sh # .ci/daint.cscs.ch/ocl.test.sh

Warn about potentially exclusive device-mode.

2e57682

hfp merged commit a7c4f3b into cp2k:develop Feb 8, 2021

hfp deleted the oclverbose branch March 1, 2021 14:16

OpenCL verbose output and documentation, improved auto-tuning scripts, minor fixes after #419 #425

OpenCL verbose output and documentation, improved auto-tuning scripts, minor fixes after #419 #425

Conversation

hfp commented Feb 4, 2021 • edited Loading

hfp commented Feb 4, 2021

hfp commented Feb 4, 2021

hfp commented Feb 4, 2021

alazzaro commented Feb 4, 2021

alazzaro commented Feb 4, 2021

hfp commented Feb 4, 2021

hfp commented Feb 4, 2021

haampie commented Feb 4, 2021

alazzaro commented Feb 4, 2021

alazzaro commented Feb 4, 2021

haampie commented Feb 4, 2021

dev-zero commented Feb 4, 2021

codecov bot commented Feb 4, 2021 • edited Loading

Codecov Report

hfp commented Feb 4, 2021

dev-zero commented Feb 4, 2021

alazzaro commented Feb 4, 2021

hfp commented Feb 4, 2021

alazzaro commented Feb 4, 2021

hfp commented Feb 4, 2021

alazzaro commented Feb 4, 2021 • edited Loading

hfp commented Feb 5, 2021

hfp commented Feb 6, 2021 • edited Loading

dev-zero commented Feb 6, 2021

hfp commented Feb 6, 2021

alazzaro commented Feb 8, 2021

dev-zero commented Feb 8, 2021

hfp commented Feb 8, 2021 • edited Loading

hfp commented Feb 8, 2021 • edited Loading

hfp commented Feb 8, 2021

alazzaro commented Feb 8, 2021

hfp commented Feb 8, 2021

alazzaro commented Feb 8, 2021

hfp commented Feb 8, 2021

alazzaro commented Feb 8, 2021

hfp commented Feb 8, 2021

hfp commented Feb 8, 2021

alazzaro commented Feb 8, 2021

alazzaro commented Feb 8, 2021

hfp commented Feb 8, 2021

alazzaro commented Feb 8, 2021

hfp commented Feb 4, 2021 •

edited

Loading

codecov bot commented Feb 4, 2021 •

edited

Loading

alazzaro commented Feb 4, 2021 •

edited

Loading

hfp commented Feb 6, 2021 •

edited

Loading

hfp commented Feb 8, 2021 •

edited

Loading

hfp commented Feb 8, 2021 •

edited

Loading