OpenMP offload

To enable OpenMP offload to GPUs in QMCPACK, use the following cmake flag.

-DENABLE_OFFLOAD=ON

Nvidia GPU

In conjunction with CUDA math libraries, add the following cmake flag.

-DENABLE_CUDA=ON # This is not the QMC_CUDA flag for the QMCPACK legacy CUDA implementation.

NVHPC

22.2 have the following issues.

~~failing test_particle due to target nowait bug.~~
~~CPU. Numerics/Quadrature.h quadrature check failing due to bad vectorization~~
~~std::min offload region is bad. Use #define MIN(a,b) ((a) <= (b) ? (a) : (b)) instead.~~

Here is the list of failing unit test.

12 - deterministic-unit_test_omptarget_blas (Subprocess aborted).
13 - deterministic-unit_test_particle (Failed).
bad memory access in kernel
25 - deterministic-unit_test_wavefunction_trialwf (Failed)
29 - deterministic-unit_test_hamiltonian_coulomb (Subprocess aborted)
depends on 13.
32 - deterministic-unit_test_estimators (Subprocess aborted)
33 - deterministic-unit_test_drivers (Subprocess aborted)
34 - deterministic-unit_test_new_drivers (Subprocess aborted)
memory error in compiler shipped OpenMP offload runtime library.

XL

XL is no longer supported due to missing C++17 compiler support.

Clang

cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF \
      -DENABLE_OFFLOAD=ON -DOFFLOAD_TARGET=nvptx64-nvidia-cuda -DOFFLOAD_ARCH=sm_70 -DUSE_OBJECT_TARGET=ON ..

List of known issues:

~~Only support CUDA 10.0 and below. https://bugs.llvm.org/show_bug.cgi?id=44587~~ Need to build libomptarget with Clang 10.
~~cmath/math.h header file conflict affecting x86 not ppc64le. https://bugs.llvm.org/show_bug.cgi?id=42061, https://bugs.llvm.org/show_bug.cgi?id=42798, https://bugs.llvm.org/show_bug.cgi?id=42799~~ to be released in Clang 11.
Static linking fat binary is still broken and causes runtime error. https://github.com/llvm/llvm-project/issues/41740 and https://github.com/llvm/llvm-project/issues/38051. We have a workaround, add -DUSE_OBJECT_TARGET=ON in cmake.
~~The offload library is single threaded and uses the default stream CUDA stream which constrains performance. http://lists.llvm.org/pipermail/openmp-dev/2019-December/002986.html~~ Some level multi-stream support is available in libomptarget to be released in clang 11.
~~(only checked with Clang8, not recently due to 1,2,3 issues) when OpenMP offload and CUDA are both enabled with the Clang compiler, there is some CUDA execution failure on X86_64~~ to be released in Clang 11.
~~offloading from multiple host threads causes data race. https://bugs.llvm.org/show_bug.cgi?id=46257~~ to be released in Clang 11

To get register usage, smem:

Add -Xcuda-ptxas -v to CMAKE_CXX_FLAGS to print per cpp
Add -v to CMAKE_EXE_LINKER_FLAGS to print at linking

For debugging or profiling

-Xcuda-ptxas --generate-line-info to CMAKE_CXX_FLAGS
--cuda-noopt-device-debug to CMAKE_CXX_FLAGS

GCC

List of issues:

OpenMP offload map cannot handle const. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104493
Complex reduction support in offload region. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98862
openmp offload linker issue. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104285

Cray

Clang derived Cray compilers 9.0 can compile but cannot link QMCPACK.

cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
      -DENABLE_OFFLOAD=1 -DOFFLOAD_ARCH=sm_70 \ ..

AMD GPU

In conjunction with ROCm math libraries, add the following cmake flag.

-DENABLE_CUDA=ON -DQMC_CUDA2HIP=ON

AOMP

Using AOMP compiler. Verified with 0.7-6 release and Radeon VII.

cmake -DCMAKE_C_COMPILER=clang  -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF -D ENABLE_OFFLOAD=ON  -DHDF5_PREFER_PARALLEL=OFF    -DQMC_CUDA2HIP=ON -DENABLE_CUDA=ON  -DQMC_OFFLOAD_ROCM_WORKAROUND_BRANCH_IN_PARALLEL=OFF -DCMAKE_PREFIX_PATH=/home/yeluo/rocm/aomp -DHIP_ROOT_DIR=/home/yeluo/rocm/aomp ..

Due to Clang issue 4 5, libomptarget is only safe to work with 1 thread. AOMP supports multiple GPU queues and the data race in libomptarget causes multi-threaded run to fail. https://github.com/ROCm-Developer-Tools/aomp/issues/23
(old issue status unclear) Excessive use of register reduces performance https://github.com/ROCm-Developer-Tools/aomp/issues/24
Runtime overhead on H2D and D2H tranfers. https://github.com/ROCm-Developer-Tools/aomp/issues/160
Reduce synchronization needed. https://github.com/ROCm-Developer-Tools/aomp/issues/161

Clang

cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF \
      -DENABLE_OFFLOAD=ON -DOFFLOAD_TARGET=amdgcn-amd-amdhsa -DOFFLOAD_ARCH=gfx906 -DUSE_OBJECT_TARGET=ON ..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP offload

Nvidia GPU

NVHPC

XL

Clang

GCC

Cray

AMD GPU

AOMP

Clang

Clone this wiki locally