Skip to content

Commit

Permalink
OpenCL verbose output and documentation, improved auto-tuning scripts…
Browse files Browse the repository at this point in the history
…, minor fixes after #419 (#425)

* OpenCL-BE/LIBSMM: verbose output and documentation. Improved auto-tuning scripts. Minor fixes after #419.

* Fixed Makefile used to build acc_bench_trans/acc_bench_smm with CUDA (accommodate changes from #419).
* Fixed issue (#419 (comment)).
* More prefixes (global variables, etc) in follow-up of #419 (c_dbcsr_).
* Introduced (runtime-)verbosity level. Print device name (non-zero verbosity).
* Renamed ACC_OPENCL_VERBOSE to ACC_OPENCL_DEBUG.
* Improved documentation and documented ACC_OPENCL_VERBOSE.
* Introduced verbose output (time needed for kernel compilation, etc).
* ACC benchmark drivers: inform if no device was found.
* Warn about potentially exclusive device-mode.
* tune_multiply.py: option to only rely on primary objective.
* tune_multiply.py: catch CTRL-C and save configuration.
* tune_multiply.sh: relay result code of failing script.
* tune_multiply.sh: continuation with wrapper script.
* Enabled runtime-test OpenCL BE/LIBSMM.
* Unrelated: removed tabs from source file.
  • Loading branch information
hfp authored Feb 8, 2021
1 parent 65be89e commit a7c4f3b
Show file tree
Hide file tree
Showing 18 changed files with 305 additions and 174 deletions.
10 changes: 5 additions & 5 deletions .ci/daint.cscs.ch/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ pipeline {
run_batch("0:15:00", "ocl", "build")
}
}
// stage('test') {
// steps {
// run_batch("1:00:00", "ocl", "test")
// }
// }
stage('test') {
steps {
run_batch("1:00:00", "ocl", "test")
}
}
}
}
stage("Intel") {
Expand Down
7 changes: 4 additions & 3 deletions .ci/daint.cscs.ch/ocl.build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
#SBATCH --constraint="mc"
#SBATCH --partition="cscsci"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --hint=nomultithread

set -o errexit
Expand All @@ -23,7 +23,8 @@ if [ ! -d "${HOME}/libxsmm" ]; then
git clone https://github.com/hfp/libxsmm.git
fi
cd "${HOME}/libxsmm"
git checkout 02d6ab213a35d5fc2f6454c3b465598b0c086c17
git fetch
git checkout 05cab50ec6f11a86c15c0ed511c5a9066c613dfb
make -j
cd ..

Expand Down
7 changes: 3 additions & 4 deletions .ci/daint.cscs.ch/ocl.test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@
#SBATCH --constraint="gpu"
#SBATCH --partition="cscsci"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-node=1
#SBATCH --hint=nomultithread

set -o errexit
Expand All @@ -21,10 +20,10 @@ set -o xtrace # do not set earlier to avoid noise from module

umask 0002 # make sure group members can access the data

mkdir --mode=0775 -p "${SCRATCH}/${BUILD_TAG}.ocl"
mkdir -p "${SCRATCH}/${BUILD_TAG}.ocl"
chmod 0775 "${SCRATCH}/${BUILD_TAG}.ocl"
cd "${SCRATCH}/${BUILD_TAG}.ocl"

export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
export OMP_PROC_BIND=TRUE # set thread affinity
# OMP_NUM_THREADS is set by cmake

Expand Down
16 changes: 11 additions & 5 deletions src/acc/acc_bench_smm.c
Original file line number Diff line number Diff line change
Expand Up @@ -106,15 +106,18 @@ int main(int argc, char* argv[])
printf("%s%s%i %i %i %i %i %i %i %i\n", 0 < argc ? argv[0] : "", 0 < argc ? " " : "",
nrepeat, stack_size, m, n, k, nc, na, nb);
CHECK(c_dbcsr_acc_init(), &result);
/* note: libsmm_acc_init() may imply acc_init() */
CHECK(libsmm_acc_init(), &result);
CHECK(c_dbcsr_acc_get_ndevices(&ndevices), &result);
if (0 < ndevices) {
#if defined(_DEBUG)
fprintf(stderr, "number of devices found: %i\n", ndevices);
#endif
}
else {
#if defined(_DEBUG)
fprintf(stderr, "Error: no device found!\n");
fprintf(stderr, "No ACC-device found!\n");
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
return result;
Expand Down Expand Up @@ -165,14 +168,14 @@ int main(int argc, char* argv[])
CHECK(libsmm_acc_transpose(trans_dev, 0/*offset*/, nb, bmat_dev,
DBCSR_TYPE(ELEM_TYPE), n, k, MAX_KERNEL_DIM, stream), &result);
}
#if defined(USE_LIBXSMM)
# if defined(USE_LIBXSMM)
CHECK(c_dbcsr_acc_stream_sync(stream), &result);
start = libxsmm_timer_tick();
#endif
# endif
/* to perform NN-SMMs on the device, all B-matrices are transposed upfront (SMM-kernel is limited to NT) */
CHECK(libsmm_acc_transpose(trans_dev, 0/*offset*/, nb, bmat_dev,
DBCSR_TYPE(ELEM_TYPE), k, n, MAX_KERNEL_DIM, stream), &result);
#if defined(USE_LIBXSMM)
# if defined(USE_LIBXSMM)
CHECK(c_dbcsr_acc_stream_sync(stream), &result);
transpose = libxsmm_timer_duration(start, libxsmm_timer_tick());
# endif
Expand Down Expand Up @@ -282,6 +285,9 @@ int main(int argc, char* argv[])
CHECK(c_dbcsr_acc_dev_mem_deallocate(bmat_dev), NULL);
CHECK(c_dbcsr_acc_dev_mem_deallocate(cmat_dev), NULL);
CHECK(c_dbcsr_acc_stream_destroy(stream), NULL);
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
if (EXIT_SUCCESS != result) {
fprintf(stderr, "FAILED\n");
Expand Down
10 changes: 8 additions & 2 deletions src/acc/acc_bench_trans.c
Original file line number Diff line number Diff line change
Expand Up @@ -91,15 +91,18 @@ int main(int argc, char* argv[])
assert(m <= (mn / n) && 0 == (mn % n));
printf("%s%s%i %i %i %i\n", 0 < argc ? argv[0] : "", 0 < argc ? " " : "", nrepeat, stack_size, m, n);
CHECK(c_dbcsr_acc_init(), &result);
/* note: libsmm_acc_init() may imply acc_init() */
CHECK(libsmm_acc_init(), &result);
CHECK(c_dbcsr_acc_get_ndevices(&ndevices), &result);
if (0 < ndevices) {
#if defined(_DEBUG)
fprintf(stderr, "number of devices found: %i\n", ndevices);
#endif
}
else {
#if defined(_DEBUG)
fprintf(stderr, "Error: no device found!\n");
fprintf(stderr, "No ACC-device found!\n");
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
return result;
Expand Down Expand Up @@ -210,6 +213,9 @@ int main(int argc, char* argv[])
CHECK(c_dbcsr_acc_dev_mem_deallocate(stack_dev), NULL);
CHECK(c_dbcsr_acc_dev_mem_deallocate(mat_dev), NULL);
CHECK(c_dbcsr_acc_stream_destroy(stream), NULL);
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
if (EXIT_SUCCESS != result) {
fprintf(stderr, "FAILED\n");
Expand Down
9 changes: 3 additions & 6 deletions src/acc/cuda/Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
INCACC := $(wildcard *.h*) ../acc.h
SRCACC := $(wildcard *.cpp)
OBJACC := $(SRCACC:.cpp=.o) acc_cublas.o
OBJACC := $(SRCACC:.cpp=.o)

GPUSMM := $(wildcard ../libsmm_acc/kernels/*.h*)
INCSMM := $(wildcard ../libsmm_acc/*.h*) ../acc_libsmm.h \
Expand Down Expand Up @@ -130,10 +130,7 @@ test: ../dbcsr_acc_test
../libsmm_acc/smm_acc_kernels.h: $(GPUSMM) Makefile ../libsmm_acc/generate_kernels.py ../libsmm_acc/parameters/parameters_$(WITH_GPU).json
@cd ../libsmm_acc && $(PYTHON) ../libsmm_acc/generate_kernels.py ../libsmm_acc/kernels

acc_cublas.o: acc_cublas.cu Makefile
$(NVCC) $(addprefix -Xcompiler $(NULL),$(CXXFLAGS)) -c $< -o $@

../dbcsr_acc.a: $(OBJACC) acc_cublas.o ../libsmm_acc/libsmm_acc_init.o
../dbcsr_acc.a: $(OBJACC) ../libsmm_acc/libsmm_acc_init.o
$(AR) -rs $@ $^

../dbcsr_acc_smm.a: $(OBJSMM)
Expand All @@ -153,7 +150,7 @@ acc_bench_trans.o: ../acc_bench_trans.c Makefile
$(CXX) $^ $(LDFLAGS) -o $@

dbcsr_acc_test.o: ../../../tests/dbcsr_acc_test.c Makefile
$(CC) $(CFLAGS) -c $< -o $@
$(CC) $(CFLAGS) -I../.. -c $< -o $@
../dbcsr_acc_test: dbcsr_acc_test.o ../dbcsr_acc_smm.a ../dbcsr_acc.a
$(CXX) $^ $(LDFLAGS) -o $@

Expand Down
8 changes: 4 additions & 4 deletions src/acc/cuda/acc_cublas.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,10 @@ int acc_blas_dgemm(ACC_BLAS(Handle_t) *handle, char transa, char transb,
ACC_BLAS_CALL(SetStream, (*handle, *stream));

ACC_BLAS_CALL(Dgemm, (*handle, cTransa, cTransb,
m, n, k,
&alpha, &a_data[a_offset], lda,
&b_data[ b_offset], ldb,
&beta, &c_data[ c_offset], lda));
m, n, k,
&alpha, &a_data[a_offset], lda,
&b_data[ b_offset], ldb,
&beta, &c_data[ c_offset], lda));

return(0);
}
11 changes: 8 additions & 3 deletions src/acc/libsmm_acc/libsmm_acc_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -350,9 +350,12 @@ int libsmm_acc_benchmark(libsmm_acc_benchmark_t* h,
best_gflops = gflops;
best_kernel = ikern;
}
} else {
}
#if !defined(NDEBUG)
else {
printf("%sOK %s\n", msg_prefix, descr);
}
#endif
}

if(h->mode == tune){
Expand Down Expand Up @@ -427,10 +430,12 @@ int libsmm_acc_benchmark_transpose_(int n_stack, int* stack, int* d_stack,
if(sumGPU != sumCPU){
printf("%sERROR %s checksum_diff: %g\n", msg_prefix, descr, sumGPU-sumCPU);
error_counter++;
} else {
}
#if !defined(NDEBUG)
else {
printf("%sOK %s\n", msg_prefix, descr);
}

#endif
return error_counter;

}
Expand Down
5 changes: 4 additions & 1 deletion src/acc/opencl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The OpenCL backend implements the [ACC interface](https://github.com/cp2k/dbcsr/

### Compile-time Settings

Compile-time settings are (implicitly) documented and can be adjusted by editing [acc_opencl.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/acc_opencl.h) (adjusting the build-line as per `-D` is possible as well but less convenient). For example, `ACC_OPENCL_STREAM_PRIORITIES` is enabled by default (and further confirmed at runtime/build-time) but can be disabled, or `ACC_OPENCL_VERBOSE` (which is disabled by default) can be enabled for debug purpose. More sensitive/private compile-time settings may be available within particular translation units like in `acc_opencl_mem.c`.
Compile-time settings are (implicitly) documented and can be adjusted by editing [acc_opencl.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/acc_opencl.h) (adjusting the build-line as per `-D` is possible as well but less convenient). For example, `ACC_OPENCL_STREAM_PRIORITIES` is enabled by default (and further confirmed at runtime/build-time) but can be disabled, or `ACC_OPENCL_DEBUG` (which is disabled by default) can be enabled for debug purpose. More sensitive/private compile-time settings may be available within particular translation units like in `acc_opencl_mem.c`.

An application of compile-time settings (and perhaps a valuable contribution) might be to call a GPU library in OpenCL-based LIBSMM. In such case, Shared Virtual Memory support (SVM) in OpenCL comes handy and can be enabled per `ACC_OPENCL_SVM`. The latter allows then to simply take the raw pointer out of an `cl_mem` object, and pass it into such library/function (which in turn can work across language borders, etc.).

Expand All @@ -19,6 +19,9 @@ Runtime settings are made by the means of environment variables (implemented in
* `ACC_OPENCL_VENDOR`: character string matching the vendor of the OpenCL device in an case-insensitive fashion, e.g., "intel".
* `ACC_OPENCL_DEVTYPE`: character string matching the device-kind like "cpu", "gpu", or another kind if neither CPU or GPU.
* `ACC_OPENCL_DEVICE`: non-negative integer number to select a device from the (internally enumerated) list of devices.
* `ACC_OPENCL_VERBOSE`: verbosity level (integer).
* `ACC_OPENCL_VERBOSE=1`: outputs (stderr) the number of devices found and the name of the selected device.
* `ACC_OPENCL_VERBOSE=2`: outputs (stderr) the duration needed to generate a requested kernel.

The OpenCL backend enumerates and orders devices primarily by device-kind (GPU, CPU, and others in that order) and by memory capacity (secondary criterion). Device IDs are zero-based as per ACC interface (and less than what is permitted/returned by `acc_get_ndevices`).

Expand Down
Loading

0 comments on commit a7c4f3b

Please sign in to comment.