Skip to content
Ye Luo edited this page Mar 7, 2022 · 42 revisions

To enable OpenMP offload to GPUs in QMCPACK, use the following cmake flag.

-DENABLE_OFFLOAD=ON

Nvidia GPU

In conjunction with CUDA math libraries, add the following cmake flag.

-DENABLE_CUDA=ON # This is not the QMC_CUDA flag for the QMCPACK legacy CUDA implementation.

NVHPC

22.2 have the following issues.

  1. failing test_particle due to target nowait bug.
  2. CPU. Numerics/Quadrature.h quadrature check failing due to bad vectorization
  3. std::min offload region is bad. Use #define MIN(a,b) ((a) <= (b) ? (a) : (b)) instead.

Here is the list of failing unit test.

12 - deterministic-unit_test_omptarget_blas (Subprocess aborted).
13 - deterministic-unit_test_particle (Failed).
bad memory access in kernel
25 - deterministic-unit_test_wavefunction_trialwf (Failed)
29 - deterministic-unit_test_hamiltonian_coulomb (Subprocess aborted)
depends on 13.
32 - deterministic-unit_test_estimators (Subprocess aborted)
33 - deterministic-unit_test_drivers (Subprocess aborted)
34 - deterministic-unit_test_new_drivers (Subprocess aborted)
memory error in compiler shipped OpenMP offload runtime library.

XL

XL is no longer supported due to missing C++17 compiler support.

Clang

cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF \
      -DENABLE_OFFLOAD=ON -DOFFLOAD_TARGET=nvptx64-nvidia-cuda -DOFFLOAD_ARCH=sm_70 -DUSE_OBJECT_TARGET=ON ..

List of known issues:

  1. Only support CUDA 10.0 and below. https://bugs.llvm.org/show_bug.cgi?id=44587 Need to build libomptarget with Clang 10.
  2. cmath/math.h header file conflict affecting x86 not ppc64le. https://bugs.llvm.org/show_bug.cgi?id=42061, https://bugs.llvm.org/show_bug.cgi?id=42798, https://bugs.llvm.org/show_bug.cgi?id=42799 to be released in Clang 11.
  3. Static linking fat binary is still broken and causes runtime error. https://github.com/llvm/llvm-project/issues/41740 and https://github.com/llvm/llvm-project/issues/38051. We have a workaround, add -DUSE_OBJECT_TARGET=ON in cmake.
  4. The offload library is single threaded and uses the default stream CUDA stream which constrains performance. http://lists.llvm.org/pipermail/openmp-dev/2019-December/002986.html Some level multi-stream support is available in libomptarget to be released in clang 11.
  5. (only checked with Clang8, not recently due to 1,2,3 issues) when OpenMP offload and CUDA are both enabled with the Clang compiler, there is some CUDA execution failure on X86_64 to be released in Clang 11.
  6. offloading from multiple host threads causes data race. https://bugs.llvm.org/show_bug.cgi?id=46257 to be released in Clang 11

To get register usage, smem:

  1. Add -Xcuda-ptxas -v to CMAKE_CXX_FLAGS to print per cpp
  2. Add -v to CMAKE_EXE_LINKER_FLAGS to print at linking

For debugging or profiling

  1. -Xcuda-ptxas --generate-line-info to CMAKE_CXX_FLAGS
  2. --cuda-noopt-device-debug to CMAKE_CXX_FLAGS

GCC

List of issues:

  1. OpenMP offload map cannot handle const. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104493
  2. Complex reduction support in offload region. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98862
  3. openmp offload linker issue. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104285

Cray

Clang derived Cray compilers 9.0 can compile but cannot link QMCPACK.

cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
      -DENABLE_OFFLOAD=1 -DOFFLOAD_ARCH=sm_70 \ ..

AMD GPU

In conjunction with ROCm math libraries, add the following cmake flag.

-DENABLE_CUDA=ON -DQMC_CUDA2HIP=ON

AOMP

Using AOMP compiler. Verified with 0.7-6 release and Radeon VII.

cmake -DCMAKE_C_COMPILER=clang  -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF -D ENABLE_OFFLOAD=ON  -DHDF5_PREFER_PARALLEL=OFF    -DQMC_CUDA2HIP=ON -DENABLE_CUDA=ON  -DQMC_OFFLOAD_ROCM_WORKAROUND_BRANCH_IN_PARALLEL=OFF -DCMAKE_PREFIX_PATH=/home/yeluo/rocm/aomp -DHIP_ROOT_DIR=/home/yeluo/rocm/aomp ..
  1. Due to Clang issue 4 5, libomptarget is only safe to work with 1 thread. AOMP supports multiple GPU queues and the data race in libomptarget causes multi-threaded run to fail. https://github.com/ROCm-Developer-Tools/aomp/issues/23
  2. (old issue status unclear) Excessive use of register reduces performance https://github.com/ROCm-Developer-Tools/aomp/issues/24
  3. Runtime overhead on H2D and D2H tranfers. https://github.com/ROCm-Developer-Tools/aomp/issues/160
  4. Reduce synchronization needed. https://github.com/ROCm-Developer-Tools/aomp/issues/161

Clang

cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF \
      -DENABLE_OFFLOAD=ON -DOFFLOAD_TARGET=amdgcn-amd-amdhsa -DOFFLOAD_ARCH=gfx906 -DUSE_OBJECT_TARGET=ON ..