forked from QMCPACK/qmcpack
-
Notifications
You must be signed in to change notification settings - Fork 2
OpenMP offload
Ye Luo edited this page Mar 7, 2022
·
42 revisions
To enable OpenMP offload to GPUs in QMCPACK, use the following cmake flag.
-DENABLE_OFFLOAD=ON
In conjunction with CUDA math libraries, add the following cmake flag.
-DENABLE_CUDA=ON # This is not the QMC_CUDA flag for the QMCPACK legacy CUDA implementation.
22.2 have the following issues.
failing test_particle due to target nowait bug.CPU. Numerics/Quadrature.h quadrature check failing due to bad vectorizationstd::min offload region is bad. Use#define MIN(a,b) ((a) <= (b) ? (a) : (b))
instead.
Here is the list of failing unit test.
12 - deterministic-unit_test_omptarget_blas (Subprocess aborted).
13 - deterministic-unit_test_particle (Failed).
bad memory access in kernel
25 - deterministic-unit_test_wavefunction_trialwf (Failed)
29 - deterministic-unit_test_hamiltonian_coulomb (Subprocess aborted)
depends on 13.
32 - deterministic-unit_test_estimators (Subprocess aborted)
33 - deterministic-unit_test_drivers (Subprocess aborted)
34 - deterministic-unit_test_new_drivers (Subprocess aborted)
memory error in compiler shipped OpenMP offload runtime library.
XL is no longer supported due to missing C++17 compiler support.
cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF \
-DENABLE_OFFLOAD=ON -DOFFLOAD_TARGET=nvptx64-nvidia-cuda -DOFFLOAD_ARCH=sm_70 -DUSE_OBJECT_TARGET=ON ..
List of known issues:
-
Only support CUDA 10.0 and below. https://bugs.llvm.org/show_bug.cgi?id=44587Need to build libomptarget with Clang 10. -
cmath/math.h header file conflict affecting x86 not ppc64le. https://bugs.llvm.org/show_bug.cgi?id=42061, https://bugs.llvm.org/show_bug.cgi?id=42798, https://bugs.llvm.org/show_bug.cgi?id=42799to be released in Clang 11. - Static linking fat binary is still broken and causes runtime error. https://github.com/llvm/llvm-project/issues/41740 and https://github.com/llvm/llvm-project/issues/38051. We have a workaround, add -DUSE_OBJECT_TARGET=ON in cmake.
-
The offload library is single threaded and uses the default stream CUDA stream which constrains performance. http://lists.llvm.org/pipermail/openmp-dev/2019-December/002986.htmlSome level multi-stream support is available in libomptarget to be released in clang 11. -
(only checked with Clang8, not recently due to 1,2,3 issues) when OpenMP offload and CUDA are both enabled with the Clang compiler, there is some CUDA execution failure on X86_64to be released in Clang 11. -
offloading from multiple host threads causes data race. https://bugs.llvm.org/show_bug.cgi?id=46257to be released in Clang 11
To get register usage, smem:
- Add -Xcuda-ptxas -v to CMAKE_CXX_FLAGS to print per cpp
- Add -v to CMAKE_EXE_LINKER_FLAGS to print at linking
For debugging or profiling
- -Xcuda-ptxas --generate-line-info to CMAKE_CXX_FLAGS
- --cuda-noopt-device-debug to CMAKE_CXX_FLAGS
List of issues:
- OpenMP offload map cannot handle const. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104493
- Complex reduction support in offload region. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98862
- openmp offload linker issue. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104285
Clang derived Cray compilers 9.0 can compile but cannot link QMCPACK.
cmake -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC \
-DENABLE_OFFLOAD=1 -DOFFLOAD_ARCH=sm_70 \ ..
In conjunction with ROCm math libraries, add the following cmake flag.
-DENABLE_CUDA=ON -DQMC_CUDA2HIP=ON
Using AOMP compiler. Verified with 0.7-6 release and Radeon VII.
cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF -D ENABLE_OFFLOAD=ON -DHDF5_PREFER_PARALLEL=OFF -DQMC_CUDA2HIP=ON -DENABLE_CUDA=ON -DQMC_OFFLOAD_ROCM_WORKAROUND_BRANCH_IN_PARALLEL=OFF -DCMAKE_PREFIX_PATH=/home/yeluo/rocm/aomp -DHIP_ROOT_DIR=/home/yeluo/rocm/aomp ..
Due to Clang issue45, libomptarget is only safe to work with 1 thread. AOMP supports multiple GPU queues and the data race in libomptarget causes multi-threaded run to fail. https://github.com/ROCm-Developer-Tools/aomp/issues/23- (old issue status unclear) Excessive use of register reduces performance https://github.com/ROCm-Developer-Tools/aomp/issues/24
- Runtime overhead on H2D and D2H tranfers. https://github.com/ROCm-Developer-Tools/aomp/issues/160
- Reduce synchronization needed. https://github.com/ROCm-Developer-Tools/aomp/issues/161
cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DQMC_MPI=OFF \
-DENABLE_OFFLOAD=ON -DOFFLOAD_TARGET=amdgcn-amd-amdhsa -DOFFLOAD_ARCH=gfx906 -DUSE_OBJECT_TARGET=ON ..