Skip to content

Commit

Permalink
ocl: GCN atomics for Mi-GPUs and other improvements
Browse files Browse the repository at this point in the history
* GCN: Rely on builtin atomics (Mi-GPUs); TODO: tuned params and check if supported.
* Allow relying on platform name instead of device name (c_dbcsr_acc_opencl_device_vendor).
* Introduced optional kernel-flags (opencl_libsmm_smm_t); needs some followup work later.
* Improved kernel by relying on work_group_broadcast in general.

* improved documentation for CUDA/OpenCL stand-alone drivers
* Modernized Shell script (acc_opencl.sh).
* Improved some runtime error messages.
* Some more debug/developer settings.
  • Loading branch information
hfp committed Sep 29, 2023
1 parent 5386da2 commit 86905e7
Show file tree
Hide file tree
Showing 8 changed files with 114 additions and 55 deletions.
20 changes: 16 additions & 4 deletions src/acc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,25 @@ The code for both the CUDA and the HIP backend is unified, and can be found in t

## Drivers

There are two stand-alone sample codes or drivers exercising the ACC-interface. The driver code (only depending on above mentioned interfaces) can be built locally and in a rather self-contained fashion, i.e., no DBCSR library is needed (except runtime libraries such as CUDA, HIP, OpenCL). For OpenCL, the LIBXSMM library is mandatory.
There are two stand-alone sample codes or drivers exercising the ACC-interface. The driver code (only depending on above mentioned interfaces) can be built locally and in a rather self-contained fashion, i.e., no DBCSR library is needed (except runtime libraries such as CUDA, HIP, OpenCL). For OpenCL, the LIBXSMM library is mandatory and preferred as baseline and for validation in any case. To build LIBXSMM, a folder `libxsmm` in parallel to DBCSR's root directory (`dbcsr`) is expected to be present and prebuilt.

To build the driver code, a folder `libxsmm` in parallel to DBCSR's root directory (`dbcsr`) is expected to be present and prebuilt (`make GNU=1` in LIBXSMM's root directory). To build the driver code, change into the respective backend folder (`cuda` or `opencl`), and invoke `make` (`DBG=0|1|2` is supported among other optional key-value pairs).
```bash
git clone -b main https://github.com/libxsmm/libxsmm.git
cd libxsmm
make GNU=1 -j
```

To build the driver code (`opencl` in below example), change into the respective backend folder (`cuda` or `opencl`), and invoke `make` (`DBG=0|1|2` is supported among other optional key-value pairs).

```bash
git clone https://github.com/cp2k/dbcsr.git
cd dbcsr/src/acc/opencl
make
```

**NOTE**: To activate a certain device, the drivers consider an environment variable called `DEVICE`. For example, `DEVICE=1 ./acc_bench_trans` activates the second device (at least two devices must be discovered).
**NOTE**: To activate a certain device, the drivers consider an environment variable called `DEVICE`. For example, `DEVICE=1 ./acc_bench_trans` activates the second device (at least two devices must be discovered). This environment variable is implemented by the driver code and meant to work across backends, i.e., the OpenCL backend also supports `ACC_OPENCL_DEVICE=1` (see Developer Guide for the OpenCL backend).

The drivers support a few command line options (_nrepeat_, _stack_size_, _m_, _n_, ...). Command line arguments are positional but allow `0` as placeholder to access the default value (`acc_bench_smm 0 0 5 13 5` performs the default number of repetitions with the default stacksize when running the 5x13x5-kernel). For example, running the tranpose benchmark may look like:
The drivers support command line options (_nrepeat_, _stack_size_, _m_, _n_, ...). Command line arguments are positional but allow `0` as placeholder to refer to the default value (`acc_bench_smm 0 0 5 13 5` performs the default number of repetitions with the default stacksize when running the 5x13x5-kernel). For example, running the tranpose benchmark may look like:

```bash
$ OMP_PROC_BIND=TRUE ./acc_bench_trans 5 30000 23 23
Expand Down
29 changes: 22 additions & 7 deletions src/acc/opencl/acc_opencl.c
Original file line number Diff line number Diff line change
Expand Up @@ -728,12 +728,23 @@ int c_dbcsr_acc_opencl_device_id(cl_device_id device, int* device_id, int* globa
}


int c_dbcsr_acc_opencl_device_vendor(cl_device_id device, const char vendor[]) {
int c_dbcsr_acc_opencl_device_vendor(cl_device_id device, const char vendor[], int use_platform_name) {
char buffer[ACC_OPENCL_BUFFERSIZE];
int result = EXIT_SUCCESS;
assert(NULL != device && NULL != vendor);
ACC_OPENCL_CHECK(
clGetDeviceInfo(device, CL_DEVICE_VENDOR, ACC_OPENCL_BUFFERSIZE, buffer, NULL), "retrieve device vendor", result);
if (0 == use_platform_name) {
ACC_OPENCL_CHECK(
clGetDeviceInfo(device, CL_DEVICE_VENDOR, ACC_OPENCL_BUFFERSIZE, buffer, NULL), "retrieve device vendor", result);
}
else {
cl_platform_id platform_id;
ACC_OPENCL_CHECK(
clGetDeviceInfo(device, CL_DEVICE_PLATFORM, sizeof(cl_platform_id), &platform_id, NULL), "retrieve platform id", result);
if (EXIT_SUCCESS == result) {
ACC_OPENCL_CHECK(
clGetPlatformInfo(platform_id, CL_PLATFORM_NAME, ACC_OPENCL_BUFFERSIZE, buffer, NULL), "retrieve platform name", result);
}
}
if (EXIT_SUCCESS == result) {
result = (NULL != LIBXSMM_STRISTR(buffer, vendor) ? EXIT_SUCCESS : EXIT_FAILURE);
}
Expand All @@ -744,7 +755,7 @@ int c_dbcsr_acc_opencl_device_vendor(cl_device_id device, const char vendor[]) {
int c_dbcsr_acc_opencl_device_uid(cl_device_id device, const char devname[], unsigned int* uid) {
int result;
if (NULL != uid) {
if (NULL != device && EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(device, "intel")) {
if (NULL != device && EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(device, "intel", 0 /*use_platform_name*/)) {
result = clGetDeviceInfo(device, 0x4251 /*CL_DEVICE_ID_INTEL*/, sizeof(unsigned int), uid, NULL);
}
else result = EXIT_FAILURE;
Expand Down Expand Up @@ -931,7 +942,9 @@ int c_dbcsr_acc_opencl_create_context(int thread_id, cl_device_id active_id) {
}
}
}
else if (CL_INVALID_DEVICE == result && EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(active_id, "nvidia")) {
else if (CL_INVALID_DEVICE == result &&
EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(active_id, "nvidia", 0 /*use_platform_name*/))
{
fprintf(stderr, "WARN ACC/OpenCL: if MPI-ranks target the same device in exclusive mode,\n"
" SMI must be used to enable sharing the device.\n");
}
Expand Down Expand Up @@ -996,7 +1009,7 @@ int c_dbcsr_acc_opencl_set_active_device(int thread_id, int device_id) {
c_dbcsr_acc_opencl_config.device[thread_id].uid = (cl_uint)-1;
}
c_dbcsr_acc_opencl_config.device[thread_id].intel =
(EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(active_id, "intel") ? CL_TRUE : CL_FALSE);
(EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(active_id, "intel", 0 /*use_platform_name*/) ? CL_TRUE : CL_FALSE);
}
}
}
Expand Down Expand Up @@ -1259,7 +1272,9 @@ int c_dbcsr_acc_opencl_kernel(int source_is_file, const char source[], const cha
const int cl_std_len = (int)strlen(cl_std);
nchar = LIBXSMM_SNPRINTF(buffer, sizeof(buffer),
ACC_OPENCL_CPPBIN " -P -C -nostdinc -D__OPENCL_VERSION__=%u %s %s %s %s >%s.cl", 100 * level_major + 10 * level_minor,
EXIT_SUCCESS != c_dbcsr_acc_opencl_device_vendor(active_id, "nvidia") ? "" : "-D__NV_CL_C_VERSION",
EXIT_SUCCESS != c_dbcsr_acc_opencl_device_vendor(active_id, "nvidia", 0 /*use_platform_name*/)
? ""
: "-D__NV_CL_C_VERSION",
NULL != build_params ? build_params : "", buffer_name, sed_pattern, kernel_name);
if (0 < nchar && (int)sizeof(buffer) > nchar &&
(0 == cl_std_len || (3 == write(file_tmp, "/*\n", 3) && cl_std_len == write(file_tmp, cl_std, cl_std_len) &&
Expand Down
18 changes: 11 additions & 7 deletions src/acc/opencl/acc_opencl.h
Original file line number Diff line number Diff line change
Expand Up @@ -135,14 +135,18 @@
# define ACC_OPENCL_OMP_TID() (/*main*/ 0)
#endif

#if LIBXSMM_VERSION4(1, 17, 0, 0) < LIBXSMM_VERSION_NUMBER
# define ACC_OPENCL_EXPECT(EXPR) LIBXSMM_EXPECT(EXPR)
#else
# define ACC_OPENCL_EXPECT(EXPR) \
if (0 == (EXPR)) assert(0);
#if 1
# if LIBXSMM_VERSION4(1, 17, 0, 0) < LIBXSMM_VERSION_NUMBER
# define ACC_OPENCL_EXPECT(EXPR) LIBXSMM_EXPECT(EXPR)
# else
# define ACC_OPENCL_EXPECT(EXPR) \
if (0 == (EXPR)) assert(0);
# endif
#else /* elide */
# define ACC_OPENCL_EXPECT(EXPR) (void)(EXPR)
#endif

#if !defined(NDEBUG)
#if !defined(NDEBUG) && 1
# define ACC_OPENCL_CHECK(EXPR, MSG, RESULT) \
do { \
if (EXIT_SUCCESS == (RESULT)) { \
Expand Down Expand Up @@ -301,7 +305,7 @@ int c_dbcsr_acc_opencl_device(int thread_id, cl_device_id* device);
/** Get device-ID for given device, and optionally global device-ID. */
int c_dbcsr_acc_opencl_device_id(cl_device_id device, int* device_id, int* global_id);
/** Confirm the vendor of the given device. */
int c_dbcsr_acc_opencl_device_vendor(cl_device_id device, const char vendor[]);
int c_dbcsr_acc_opencl_device_vendor(cl_device_id device, const char vendor[], int use_platform_name);
/** Capture or calculate UID based on the device-name. */
int c_dbcsr_acc_opencl_device_uid(cl_device_id device, const char devname[], unsigned int* uid);
/** Based on the device-ID, return the device's UID (capture or calculate), device name, and platform name. */
Expand Down
12 changes: 6 additions & 6 deletions src/acc/opencl/acc_opencl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ then
if [ "${CLFILE##*.}" = "cl" ]; then
if [ -e "${CLFILE}" ]; then
BNAME=$(${BASENAME} "${CLFILE}" .cl)
UNAME=$(echo "${BNAME}" | ${TR} '[:lower:]' '[:upper:]')
UNAME=$(${TR} '[:lower:]' '[:upper:]' <<<"${BNAME}")
SNAME=OPENCL_LIBSMM_STRING_${UNAME}
VNAME=opencl_libsmm_source_${BNAME}
MNAME=OPENCL_LIBSMM_SOURCE_${UNAME}
Expand Down Expand Up @@ -167,28 +167,28 @@ then
SNAME=OPENCL_LIBSMM_STRING_PARAMS_SMM
VNAME=opencl_libsmm_params_smm
DNAME=opencl_libsmm_devices
MNAME=$(echo "${VNAME}" | ${TR} '[:lower:]' '[:upper:]')
NNAME=$(echo "${DNAME}" | ${TR} '[:lower:]' '[:upper:]')
MNAME=$(${TR} '[:lower:]' '[:upper:]' <<<"${VNAME}")
NNAME=$(${TR} '[:lower:]' '[:upper:]' <<<"${DNAME}")
if [ "${DEVICES}" ]; then
echo >>"${OFILE}"
echo "#define ${MNAME} ${VNAME}" >>"${OFILE}"
echo "#define ${SNAME} \\" >>"${OFILE}"
CSVLINES=$(for CSVFILE in "${CSVFILES[@]}"; do ${SED} "1d;/^[[:space:]]*$/d;s/[\r]*$/\\\n\" \\\/" "${CSVFILE}"; done)
IFS=$'\n'
for LINE in ${CSVLINES}; do
I=0; IDEVICE=$(echo "${LINE}" | ${SED} "${DEVPAT}")
I=0; IDEVICE=$(${SED} "${DEVPAT}" <<<"${LINE}")
for DEVICE in ${DEVICES}; do
if [ "${DEVICE}" = "${IDEVICE}" ]; then break; fi
I=$((I+1));
done
echo "${LINE}" | ${SED} "s/[^${DELIM}]*//;s/^/ \"${I}/" >>"${OFILE}"
${SED} "s/[^${DELIM}]*//;s/^/ \"${I}/" <<<"${LINE}" >>"${OFILE}"
done
echo " \"\"" >>"${OFILE}"
echo "static const char ${VNAME}[] = ${SNAME};" >>"${OFILE}"
echo >>"${OFILE}"
echo "#define ${NNAME} ${DNAME}" >>"${OFILE}"
echo "static const char *const ${DNAME}[] = {" >>"${OFILE}"
I=0; S=","; NDEVICES=$(echo "${DEVICES}" | ${WC} -l)
I=0; S=","; NDEVICES=$(${WC} -l <<<"${DEVICES}")
for DEVICE in ${DEVICES}; do
I=$((I+1)); if [ "0" != "$((NDEVICES==I))" ]; then S=""; fi
echo " \"${DEVICE}\"${S}" >>"${OFILE}"
Expand Down
20 changes: 7 additions & 13 deletions src/acc/opencl/acc_opencl_stream.c
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
/*------------------------------------------------------------------------------------------------*/
/* Copyright (C) by the DBCSR developers group - All rights reserved */
/* This file is part of the DBCSR library. */
/* Copyright (C) by the DBCSR developers group - All rights reserved */
/* This file is part of the DBCSR library. */
/* */
/* For information on the license, see the LICENSE file. */
/* For further information please visit https://dbcsr.cp2k.org */
/* SPDX-License-Identifier: GPL-2.0+ */
/* For information on the license, see the LICENSE file. */
/* For further information please visit https://dbcsr.cp2k.org */
/* SPDX-License-Identifier: GPL-2.0+ */
/*------------------------------------------------------------------------------------------------*/
#if defined(__OPENCL)
# include "acc_opencl.h"
Expand All @@ -19,22 +19,19 @@
clCreateCommandQueue(CTX, DEV, (cl_command_queue_properties)(NULL != (PROPS) ? ((PROPS)[1]) : 0), RESULT)
# endif


# if defined(__cplusplus)
extern "C" {
# endif

int c_dbcsr_acc_opencl_stream_counter_base;
int c_dbcsr_acc_opencl_stream_counter;


c_dbcsr_acc_opencl_info_stream_t* c_dbcsr_acc_opencl_info_stream(void* stream) {
assert(NULL == stream || sizeof(c_dbcsr_acc_opencl_info_stream_t) <= (uintptr_t)stream);
return (
NULL != stream ? ((c_dbcsr_acc_opencl_info_stream_t*)((uintptr_t)stream - sizeof(c_dbcsr_acc_opencl_info_stream_t))) : NULL);
}


const int* c_dbcsr_acc_opencl_stream_priority(const void* stream) {
const int* result;
# if !defined(ACC_OPENCL_STREAM_PRIORITIES)
Expand All @@ -50,7 +47,6 @@ const int* c_dbcsr_acc_opencl_stream_priority(const void* stream) {
return result;
}


int c_dbcsr_acc_stream_create(void** stream_p, const char* name, int priority) {
ACC_OPENCL_STREAM_PROPERTIES_TYPE properties[8] = {
CL_QUEUE_PROPERTIES, 0 /*placeholder*/, 0 /* terminator */
Expand Down Expand Up @@ -245,7 +241,6 @@ int c_dbcsr_acc_stream_create(void** stream_p, const char* name, int priority) {
ACC_OPENCL_RETURN_CAUSE(result, name);
}


int c_dbcsr_acc_stream_destroy(void* stream) {
int result = EXIT_SUCCESS;
# if defined(__DBCSR_ACC) && defined(ACC_OPENCL_PROFILE)
Expand Down Expand Up @@ -297,7 +292,6 @@ int c_dbcsr_acc_stream_destroy(void* stream) {
ACC_OPENCL_RETURN(result);
}


int c_dbcsr_acc_stream_priority_range(int* least, int* greatest) {
int result = ((NULL != least || NULL != greatest) ? EXIT_SUCCESS : EXIT_FAILURE);
int priohi = -1, priolo = -1;
Expand All @@ -321,7 +315,8 @@ int c_dbcsr_acc_stream_priority_range(int* least, int* greatest) {
ACC_OPENCL_CHECK(clGetPlatformInfo(platform, CL_PLATFORM_EXTENSIONS, ACC_OPENCL_BUFFERSIZE, buffer, NULL),
"retrieve platform extensions", result);
if (EXIT_SUCCESS == result) {
if (NULL != strstr(buffer, "cl_khr_priority_hints") || EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(active_id, "nvidia"))
if (NULL != strstr(buffer, "cl_khr_priority_hints") ||
EXIT_SUCCESS == c_dbcsr_acc_opencl_device_vendor(active_id, "nvidia", 0 /*use_platform_name*/))
{
priohi = CL_QUEUE_PRIORITY_HIGH_KHR;
priolo = CL_QUEUE_PRIORITY_LOW_KHR;
Expand All @@ -337,7 +332,6 @@ int c_dbcsr_acc_stream_priority_range(int* least, int* greatest) {
ACC_OPENCL_RETURN(result);
}


int c_dbcsr_acc_stream_sync(void* stream) {
int result = EXIT_SUCCESS;
# if defined(ACC_OPENCL_STREAM_PRIORITIES)
Expand Down
11 changes: 8 additions & 3 deletions src/acc/opencl/smm/kernels/multiply.cl
Original file line number Diff line number Diff line change
Expand Up @@ -545,14 +545,19 @@ FN(global T* restrict cdata, GLOBAL const T* restrict adata, GLOBAL const T* res
# if defined(BARRIER) && (MAX(1, SGS) < SWG) && defined(SLM_A)
BARRIER(CLK_LOCAL_MEM_FENCE);
# endif
# if (WRK == SM) && (SGS >= SM) && !defined(SLM_A) && !defined(REG_A)
# if (WRK == SM) && (SM <= SGS || SM <= SWG) && !defined(SLM_A) && !defined(REG_A)
const T a = AMK(idx, k);
# endif
UNROLL_FORCE(SM)
for (short m = 0; m < SM; ++m) {
# if (WRK == SM) && (SGS >= SM) && !defined(SLM_A) && !defined(REG_A)
# if (200 /*CL_VERSION_2_0*/ <= __OPENCL_VERSION__) && !defined(SLM_A) && !defined(REG_A) && (WRK == SM) && \
(SM <= SGS || SM <= SWG) /* size of subgroup or size of workgroup is sufficient */
# if (SM <= SGS)
CNM(idx, m) = MAD(sub_group_broadcast(a, m), b, CNM(idx, m));
# else
# else
CNM(idx, m) = MAD(work_group_broadcast(a, m), b, CNM(idx, m));
# endif
# else /* fallback */
CNM(idx, m) = MAD(AMK(m, k), b, CNM(idx, m));
# endif
}
Expand Down
Loading

0 comments on commit 86905e7

Please sign in to comment.