Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL verbose output and documentation, improved auto-tuning scripts, minor fixes after #419 #425

Merged
merged 23 commits into from
Feb 8, 2021

Conversation

hfp
Copy link
Member

@hfp hfp commented Feb 4, 2021

OpenCL-BE/LIBSMM: verbose output and documentation. Improved auto-tuning scripts. Minor fixes after #419.

  • Fixed Makefile used to build acc_bench_trans/acc_bench_smm with CUDA (accommodate changes from Simplify the CMake ROCm detection #419).
  • Introduced (runtime-)verbosity level. Print device name (non-zero verbosity).
  • Fixed issue (Simplify the CMake ROCm detection #419 (comment)).
  • Renamed ACC_OPENCL_VERBOSE to ACC_OPENCL_DEBUG.
  • ACC benchmark drivers: inform if no device was found.
  • Improved documentation and documented ACC_OPENCL_VERBOSE.
  • Introduced verbose output (time needed for kernel compilation, etc).
  • tune_multiply.py: option to only rely on primary objective.
  • tune_multiply.py: catch CTRL-C and save configuration.
  • tune_multiply.sh: relay result code of failing script.
  • tune_multiply.sh: continuation with wrapper script.

…ing scripts. Minor fixes after cp2k#419.

* Introduced (runtime-)verbosity level. Print device name (non-zero verbosity).
* Fixed issue (cp2k#419 (comment)).
* Renamed ACC_OPENCL_VERBOSE to ACC_OPENCL_DEBUG.
* ACC benchmark drivers: inform if no device was found.
* Improved documentation and documented ACC_OPENCL_VERBOSE.
* Introduced verbose output (time needed for kernel compilation, etc).
* tune_multiply.py: option to only rely on primary objective.
* tune_multiply.py: catch CTRL-C and save configuration.
* tune_multiply.sh: relay result code of failing script.
* tune_multiply.sh: continuation with wrapper script.
…ze in a parallel region which makes this code ineffective.
@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

There will be one more change for this PR, which disables ACC_OPENCL_THREADLOCAL_CONTEXT feature in OpenCL backend.

@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

I will also try enabling runtime tests for the OpenCL backend on Daint-CI.

@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:

20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]

... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.

@alazzaro
Copy link
Member

alazzaro commented Feb 4, 2021

For some reason, Daint-CI cannot run a test in 900s. @hfp could you try to increase the time limit (1200s?) ?

@alazzaro
Copy link
Member

alazzaro commented Feb 4, 2021

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:

20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]

... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.

Ah, good spot!

@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

For some reason, Daint-CI cannot run a test in 900s. @hfp could you try to increase the time limit (1200s?) ?

Let me try fixing it as part of this PR.

@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:

20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]

... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.

Ah, good spot!

It seems to come from tests/libsmm_acc_unittest_multiply.cpp.template, i.e., subsequent calls contained there like libsmm_acc_benchmark. I will simply increase permitted runtime to 1200s (perhaps we never built/ran these tests that way?).

@haampie
Copy link
Contributor

haampie commented Feb 4, 2021

Not sure if relevant, but these particular tests are executed without the srun ... prefix is what I found curious when I looked into the cmake code

@alazzaro
Copy link
Member

alazzaro commented Feb 4, 2021

@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:

20/23 Test #20: libsmm_acc_unittest_multiply ..........................***Timeout 900.10 sec
Thu Feb  4 14:57:50 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Thu Feb  4 14:57:50 2021: [unset]:_pmi_init:_pmi_alps_init returned -1
# libsmm_acc has 2649 blocksizes for multiplication
OK 4 x 4 x 4
OK 4 x 4 x 5
OK 4 x 4 x 6
[...]

... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before.

Ah, good spot!

It seems to come from tests/libsmm_acc_unittest_multiply.cpp.template, i.e., subsequent calls contained there like libsmm_acc_benchmark. I will simply increase permitted runtime to 1200s (perhaps we never built/ran these tests that way?).

Well, no idea.. But I would suggest commenting the output (leave it only for Debug) if it makes the test running faster... Daint-CI has a limited budget for us...

@alazzaro
Copy link
Member

alazzaro commented Feb 4, 2021

Not sure if relevant, but these particular tests are executed without the srun ... prefix is what I found curious when I looked into the cmake code

Really? This is a GPU test, cannot run on the frontend node... We provide srun via cmake, i.e.:

-DMPIEXEC_EXECUTABLE="$(command -v srun)" \

@haampie
Copy link
Contributor

haampie commented Feb 4, 2021

Yeah, I was aware, earlier this morning I learned that sbatch executes the bash script on the compute node in fact, so therefore it works. Which made me think it was intentional to not wrap these tests in [srun command] [test]

@dev-zero
Copy link
Contributor

dev-zero commented Feb 4, 2021

Yeah, I was aware, earlier this morning I learned that sbatch executes the bash script on the compute node in fact, so therefore it works. Which made me think it was intentional to not wrap these tests in [srun command] [test]

Given that the libsmm_acc tests all directly use printf without checking whether they are on the first node, my guess is that they were intended to be run on single-node only.
That should be rectified I guess (both the code and the CMake command).

Maybe add a wrapper for add_test(...) which contains the if (USE_MPI) to keep it DRY?

@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

Merging #425 (ea4a9aa) into develop (dcbc5f6) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           develop    #425   +/-   ##
=======================================
  Coverage     63.1%   63.1%           
=======================================
  Files           86      86           
  Lines        25625   25625           
=======================================
  Hits         16190   16190           
  Misses        9435    9435           
Flag Coverage Δ
unittests 63.1% <ø> (ø)
with-blas 63.1% <ø> (ø)
with-libxsmm 63.2% <ø> (ø)
with-mpi 63.6% <ø> (ø)
with-openmp 62.3% <ø> (ø)
without-mpi 59.4% <ø> (ø)
without-openmp 62.3% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcbc5f6...2e57682. Read the comment docs.

@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

Yeah, I was aware, earlier this morning I learned that sbatch executes the bash script on the compute node in fact, so therefore it works. Which made me think it was intentional to not wrap these tests in [srun command] [test]

Given that the libsmm_acc tests all directly use printf without checking whether they are on the first node, my guess is that they were intended to be run on single-node only.
That should be rectified I guess (both the code and the CMake command).

Maybe add a wrapper for add_test(...) which contains the if (USE_MPI) to keep it DRY?

If my "hack" (reducing output) goes through with Daint-CI, I will leave this to another PR.

@dev-zero
Copy link
Contributor

dev-zero commented Feb 4, 2021

sure, I just opened #427 to track it

@alazzaro
Copy link
Member

alazzaro commented Feb 4, 2021

I wonder why the test cannot complete in 900s anymore. This is the last week output:

https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/

The infamous test complete in 629.62s.

@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

I wonder why the test cannot complete in 900s anymore. This is the last week output:

https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/

The infamous test complete in 629.62s.

Perhaps a diff between the build-logs can help if this is caused by compiler flags or such.

@alazzaro
Copy link
Member

alazzaro commented Feb 4, 2021

Well the only changes since last week and now are due to the HIP PR...
@haampie could you take a look and make a diff between the two versions?

@hfp
Copy link
Member Author

hfp commented Feb 4, 2021

Well the only changes since last week and now are due to the HIP PR...
@haampie could you take a look and make a diff between the two versions?

I just had a look at libsmm_acc_unittest_multiply and how the compile/link line changed from 679 and 685. Compilation flags are the same except -Werror. The main difference is previously (679) everything possible was linked statically. Further, 679 was relying on /opt/nvidia/cudatoolkit10.2/10.2.89_3.28-7.0.2.1_2.17__g52c0314/targets/x86_64-linux/lib/stubs whereas 685 used the non-stubs library directory (probably related to static linkage?). Also, 679 linked against CMakeFiles/libsmm_acc_unittest_multiply.dir/cmake_device_link.o, libcudadevrt, librt, and libdl (which are not present on 685's link-line). However, 685 links against libnvToolsExt, which was not present on 679's link-line.

@alazzaro
Copy link
Member

alazzaro commented Feb 4, 2021

Well the only changes since last week and now are due to the HIP PR...
@haampie could you take a look and make a diff between the two versions?

I just had a look at libsmm_acc_unittest_multiply and how the compile/link line changed from 679 and 685. Compilation flags are the same except -Werror. The main difference is previously (679) everything possible was linked statically. Further, 679 was relying on /opt/nvidia/cudatoolkit10.2/10.2.89_3.28-7.0.2.1_2.17__g52c0314/targets/x86_64-linux/lib/stubs whereas 685 used the non-stubs library directory (probably related to static linkage?). Also, 679 linked against CMakeFiles/libsmm_acc_unittest_multiply.dir/cmake_device_link.o, libcudadevrt, librt, and libdl (which are not present on 685's link-line). However, 685 links against libnvToolsExt, which was not present on 679's link-line.

Wait, libnvToolsExt is used? This is the profiler API that can explain the reason for the slowness...
However, the code should be protected with a macro...

Update: the macro __CUDA_PROFILING is not set... it needs a more appropriate analysis...

@hfp
Copy link
Member Author

hfp commented Feb 5, 2021

Can you elaborate on "manually tested on Daint (user account)"?

I experimented a bit and maybe CUDA MPS is related. At least srun --ntasks 1 ... (or equivalent salloc) get things working. I ran multiple MPI-ranks on different systems with CUDA devices but in none of the cases MPS was enabled. I will downgrade the runtime tests for the OpenCL backend to single-rank as a first try.

@hfp
Copy link
Member Author

hfp commented Feb 6, 2021

Only CodeCov "fails", i.e., OpenCL runtime tests are now passing with one rank only (see here for some early reasoning). It seems Daint runs the GPUs in "exclusive" mode which seems to only permit multiple ranks via MPS, i.e., creating a second OpenCL context on the same device (single node!) always fails like "device not available".

( As a side-note, it seems running single-ranks omits some of the tests perhaps of too much CMake magic? )

@dev-zero
Copy link
Contributor

dev-zero commented Feb 6, 2021

wrt CodeCov: the total coverage consists of different runs/uploads (with, w/o mpi; with w/o omp), and the PR entry often happens too early. We could tune it to wait for all uploads before posting or create new posts instead of partially updating the existing one.

wrt single-ranks: there is no extra CMake magic involved when it comes to deciding when to run tests, but the following determines the number of ranks at build time:

    -DTEST_MPI_RANKS=${SLURM_NTASKS} \
    -DTEST_OMP_THREADS=${SLURM_CPUS_PER_TASK} \

As mentioned in #427, the libsmm_acc tests are currently not run through srun (or mpirun) at all.

@hfp
Copy link
Member Author

hfp commented Feb 6, 2021

Thank you for following up, Tiziano!

wrt CodeCov: the total coverage consists of different runs/uploads (with, w/o mpi; with w/o omp), and the PR entry often happens too early. We could tune it to wait for all uploads before posting or create new posts instead of partially updating the existing one.

I wonder if we should drop CodeCov? I think it produces mostly noise (nothing actionable). Primarily, we are happy if the project receives a contribution, secondly, we only take PRs and then review/guide the code and format. I believe this process would never pass something unreasonable into the code base. Overall I have not seen anyone of us judging a contribution by the percentage of coverage. With the current code base, it seems a coverage of ~60% is intrinsic to the project or code base. Perhaps it is even possible to just calculate the percentage (but no threshold making things red and sending emails).

wrt single-ranks: there is no extra CMake magic involved when it comes to deciding when to run tests, but the following determines the number of ranks at build time:

    -DTEST_MPI_RANKS=${SLURM_NTASKS} \
    -DTEST_OMP_THREADS=${SLURM_CPUS_PER_TASK} \

As mentioned in #427, the libsmm_acc tests are currently not run through srun (or mpirun) at all.

I just noticed that the OpenCL runtime test not only passes (see above for the root cause), but unfortunately ran less tests (19) compared to other (23) runtime tests. The only difference seems to be the rank count. I understand that some tests are currently not running per srun, but should this not apply to all runtime tests equally no matter what the requested rank-count was?

@alazzaro alazzaro self-requested a review February 8, 2021 08:04
# Conflicts:
#	.ci/daint.cscs.ch/ocl.build.sh
#	.ci/daint.cscs.ch/ocl.test.sh
@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

@hfp I assume it is ready for review, I will do it today...

I wonder if we should drop CodeCov?

We should keep it, actually I plan to introduce new tests (long project) and add the coverage for the GPU. I like @dev-zero idea...

It seems Daint runs the GPUs in "exclusive" mode which seems to only permit multiple ranks via MPS, i.e., creating a second OpenCL context on the same device (single node!) always fails like "device not available".

Should we open a ticket with CSCS? I can try on some other machines, I wonder what's the rationale behind that... I can assume not so many people are using OpenCL.

unfortunately ran less tests (19) compared to other (23) runtime tests.

Interesting... I actually see the same number of tests...

@dev-zero
Copy link
Contributor

dev-zero commented Feb 8, 2021

I wonder if we should drop CodeCov? I think it produces mostly noise (nothing actionable). Primarily, we are happy if the project receives a contribution, secondly, we only take PRs and then review/guide the code and format. I believe this process would never pass something unreasonable into the code base. Overall I have not seen anyone of us judging a contribution by the percentage of coverage. With the current code base, it seems a coverage of ~60% is intrinsic to the project or code base. Perhaps it is even possible to just calculate the percentage (but no threshold making things red and sending emails).

The "sending emails" part is done automatically by GH when the CodeCov-bot adds the comment.
I agree that it should be tweaked further to generate less noise, but if we only let it report a percentage in the list of tests without further note, nobody is going to look at it anymore at all.

So, I would propose that I change it to wait for all reports before sending it first and changing the relative threshold to something which allows to add some uncovered fixes without error but when adding larger parts they have to be covered, and a lower bound of 60% because that's what we established now.

I just noticed that the OpenCL runtime test not only passes (see above for the root cause), but unfortunately ran less tests (19) compared to other (23) runtime tests. The only difference seems to be the rank count. I understand that some tests are currently not running per srun, but should this not apply to all runtime tests equally no matter what the requested rank-count was?

Are you sure it's not this line here:

if (USE_ACCEL MATCHES "cuda|hip")

... which should probably be just if (USE_ACCEL) again.

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

Regarding multiple ranks on a single card, I guess this can be adjusted with the SMI utility and there is likely no performance regression associated with toggling it from exclusive to shared (and MPS can do whatever it does). However, this is reconfiguring Daint and may not be a realistic request. MPS on the other does not support OpenCL (as per some answer in some Nvidia forum). Another option could be to test multiple ranks using a single rank per node and multiple nodes, which should work.

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

which should probably be just if (USE_ACCEL) again.

Thanks for finding, I overlooked this completely! I guess it should stay since these tests are specific to CUDA/HIP and the implementation auto-tuning for this backend. If that was written using the ACC interface it would have been easier, but on the other hand we can play with the OpenTuner based auto-tuning and perhaps it is fruitful to have something else.

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

If this passes, I probably want to add an error message explaining a failure when creating the OpenCL context or at least hint the MPS/SMI exclusive thing.

@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

If this passes, I probably want to add an error message explaining a failure when creating the OpenCL context or at least hint the MPS/SMI exclusive thing.

OK, please let me know when it is time to review...

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

@alazzaro the PR is ready for review.

@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

@hfp For my understanding is the solution proposed here

https://gerrit.gromacs.org/c/gromacs/+/5729/
https://gerrit.gromacs.org/c/gromacs/+/5780/7

I cannot find anything else on the topic (OpenCL with multiprocesses)...

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

@hfp For my understanding is the solution proposed here

https://gerrit.gromacs.org/c/gromacs/+/5729/
https://gerrit.gromacs.org/c/gromacs/+/5780/7

I cannot find anything else on the topic (OpenCL with multiprocesses)...

I found different stuff including to prefer clCreateContextFromType ("due to a bug") over clCreateContext. I tried this and only got a different error code like CL_DEVICE_NOT_AVAILABLE rather than CL_INVALID_DEVICE. Your finding looks more promising and potentially more applicable. I wonder if this would only affect the backend's code base or if this is a more general change? Currently, the BE code does not use MPI or broadcasts/shares anything. From the BE's perspective, all device discovery is process-specific and local.

@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

OK, then let's keep a single rank per device and open another issue for further investigation, probably employing the intranode broadcast. It seems that this is a common issue, other codes (CENAERO and SPECFEM3D) have similar problems, but Gromas seems to have found a solution...
I will review today.

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

I checked on a Daint/GPU node like srun nvidia-smi -q -d COMPUTE" and it shows:

==============NVSMI LOG==============

Timestamp                           : Mon Feb  8 11:20:52 2021
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:02:00.0
    Compute Mode                    : Exclusive_Process

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

probably employing the intranode broadcast.

They did a broadcast initially and the fix was to discover devices on a per rank basis. We already do the right thing then and it seems to be all about the exclusive mode, but let's investigate over the course of today...

@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

LGTM, Daint-CI with Intel failed to submit the job. I've restarted the job.

So, I suggest opening an issue for the OpenCL and multiple ranks per device, as we discussed (unless you have a very last-minute solution, I would keep the discussion for a different PR).

I can have some naive questions (apologies in advance):

  1. Can we use OpenCL and CUDA at the same time? Let me explain better what I have in mind: let's assume we are running on nvidia gpu with OpenCL and we ask for a kernel that is not implemented, can we fall-back to the CUDA side? I can assume it will be a mess to organize memory and so on... Another possible solution would be to have OpenCL for single precision (great that you have implemented it) and CUDA for DP (but of course we can go for all with OpenCL...)
  2. Concerning the SP part, we probably have to hack the fortran code to allow the SP part for ACC. I can remember (but maybe I'm wrong) that it is only DP support in several places...
  3. What about the CUBLAS backend, is there anything similar in OpenCL?
  4. Do you have a performance comparison CUDA/OpenCL or HIP/OpenCL?

@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

New tests on Daint-CI wenf all fine...

@hfp
Copy link
Member Author

hfp commented Feb 8, 2021

LGTM, Daint-CI with Intel failed to submit the job. I've restarted the job.

I will merge asap.

So, I suggest opening an issue for the OpenCL and multiple ranks per device, as we discussed (unless you have a very last-minute solution, I would keep the discussion for a different PR).

ACK

Thank you for the good questions! I will try answer them one by one (see below).

Can we use OpenCL and CUDA at the same time? Let me explain better what I have in mind: let's assume we are running on nvidia gpu with OpenCL and we ask for a kernel that is not implemented, can we fall-back to the CUDA side? I can assume it will be a mess to organize memory and so on... Another possible solution would be to have OpenCL for single precision (great that you have implemented it) and CUDA for DP (but of course we can go for all with OpenCL...)

Of course, it's doable, but likely quite some effort since we never accounted for multiple backends.

Concerning the SP part, we probably have to hack the fortran code to allow the SP part for ACC. I can remember (but maybe I'm wrong) that it is only DP support in several places...

Joost quite some time ago (obviously) confirmed SP at least for CPU/LIBXSMM (out of the box), but I cannot remember what/how he tested. Since a lot of code in CP2K is hard-coding DP, it was probably just plain/straight-forward QS test (maybe Water ;-). For SP in general, we probably want to gather workforce and identify if/where this is beneficial and what to enable first. This also touches enabling low/mixed-precision with SP as viable special case. DBCSR stand-alone "workloads" and all (unit-)tests already work with SP. On DBCSR's side, we can think if something else should offered like "internal type" w/ copy-in/out from higher precision, etc. Maybe like the feature about reduced MPI-traffic based on SP.

What about the CUBLAS backend, is there anything similar in OpenCL?

I have implemented SVM support already, i.e., LIBSMM can leverage calls to some other implementation like MKL. This might be similar to the CUDA/HIP/SMM_ACC (except they need this host-side stack; not sure if other GEMM-args are fed from the host or if this also relies on device-pointers like SVM). So far, this might be useful for (very) large kernels, but otherwise (CUDA/HIP) this is inefficient since the stack is processed on the host with all singular SMMs queued on the device-side. I would rather like to call dgemm_batch, however, this does not specify what happens for duplicated C-indexes (beside of taking an array of pointers). I may ask MKL team to revisit such case. Btw, LIBXSMM implements parallelized batches SMMs with C-matrix sync (which is used by benchmark drivers). To somwwhat answer your original question, calling MKL for instance with data already populated on the device will be easy given SVM is enabled (currently disabled in OpenCL backend).

Do you have a performance comparison CUDA/OpenCL or HIP/OpenCL?

At the moment, I believe HIP/OpenCL should deliver the best insight with respect to what this backend is capable of. @haampie wanted to look at some AMD GPUs; perhaps I can help/do this as well (if accessible per my CSCS account). Otherwise (Nvidia devices), the gap using OpenCL vs CUDA is reaching up to 2x with current kernels. I will improve kernels, which are by no means sophisticated yet on the OpenCL side. To get to your question, you can send me an email; I have numbers for P100, V100, A100, and perhaps others.

@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

Please merge and thanks for the replies!
One point to keep in mind concerning SP: of course DBCSR allows SP and DP (real and complex). However, only DP is available on the GPU. In the Fortran code, we have some flags to check for that... What Joost did was only for the CPU...

@hfp hfp merged commit a7c4f3b into cp2k:develop Feb 8, 2021
@hfp hfp deleted the oclverbose branch March 1, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants