Omp offload #165

vlkale · 2021-11-13T00:01:13Z

This pull request contains the following proposed contributions to AutoDock-GPU:

(1) OpenMP parallelization of the AutoDock-GPU application's work on a GPU instead of CUDA parallelization of work on a GPU and experimentation on Summit with LLVM 14's OpenMP implementation.

(2) The use of OpenMP parallelization of the AutoDock-GPU application's work to multiple GPUs of a node through (i) having multiple threads on the CPU invoke target regions to the GPU specified through the device clause on the target region (either at compile-time or runtime) and (ii) using a task-to-device scheduling library that allows a thread to dynamically select at runtime the device to run on based on the state of the GPUs (we focus on occupancy and load). We note that a task-to-GPU scheduling strategies are based on those being developed in the SOLLVE project for a variety of applications.

On Summit with 3 GPUs: Our original OpenMP version of AutoDock-GPU with a round-robin scheduling generated the correct results compared to the CUDA version. Through performance optimizations over our original OpenMP version in the last 3 weeks and using a task-to-GPU scheduling strategy that picks a random GPU, we have gotten the OpenMP version to improve 11x. The OpenMP version is still 4x slower than the CUDA version. Using nvprof on both versions, we see that most (~99%) of the application's execution time is spent in kernel3's 'gpu_perform_LS_kernel(float*, float*). We think our OpenMP version is slower primarily because of an manually optimized simd reduction that the CUDA version is using - we are having trouble at the moment translating it efficiently in OpenMP version (we just use the reduction in the target directive). We are looking to use the invoking CUDA code for the reduction within the OpenMP target region if it's possible, and also considering just invoking all CUDA code within the OpenMP target regions to see what performance we get doing that, among other optimizations to the OpenMP version.

Nevertheless, we believe that these changes hold significant promise for (a) the use of OpenMP in AutoDock-GPU and for (b) using multi-device scheduling for load balancing in particular through the use of OpenMP and its tasking capabilities.

Consider this an initial pull request to inform of our development. We expect more updates to this pull request in the coming few weeks.

We have been in contact with @atillack and Stefano Forli from Scripps in the last couple of months. Please reach out to Mathialakan @mathialakan or myself @vlkale for questions on the code.

Merge branch 'omp_offload' of https://github.com/SOLLVE/AutoDock-GPU into omp_offload

atillack · 2021-11-13T02:20:07Z

@vlkale @mathialakan Thank you. I look forward testing it out and working on it.

vlkale · 2022-02-04T15:27:09Z

@atillack Please let us know about input testing sets you are using as we are curious about that.

diogomart · 2022-02-04T17:18:00Z

@vlkale I think you may be referring to the E50 plots, which are described in J. Chem. Theory Comput. 2021, 17, 2, 1060–1073. See for example #139 (comment).

vlkale · 2022-02-26T01:29:17Z

Thanks for this earlier and sorry I didn't follow up and reply here sooner. @mathialakan and I looked at this and it is helpful to us. We wanted a representative and reasonable input data set that we can use to do more tests of this fork. I have spoken to @mathialakan and he may have a few more things to say here.

@vlkale I think you may be referring to the E50 plots, which are described in J. Chem. Theory Comput. 2021, 17, 2, 1060–1073. See for example #139 (comment).

mathialakan added 19 commits October 6, 2021 14:01

create a new branch for omp target offload version

597c29f

create a new branch for omp target offload version

6adc915

Merge branch 'ccsb-scripps:develop' into omp_offload

df2300e

omp offload code for deepcopying and main algm

cff75bf

add changes

ab116b9

Merge branch 'omp_offload' of https://github.com/SOLLVE/AutoDock-GPU into omp_offload

omp offload gpu kernels

193df1c

changes on calcenergy

62fec6a

pass docpars separately

f93ea83

fix bugs and include compiler script

87f6b48

set device based on user choice

51675db

add a test case for file-list

278927c

changes to use multi GPUs

5b6c555

use 3 GPUs

824096b

cleanup code

4bd257e

cleanup code

ec90d47

use map_to instead of map_aloc + update

762ce82

add task to gpu scheduling strategies

3a84b7d

add task loop

e6eed15

Use the approach of parallel construct nested to teams region

a420f19

mathialakan added 4 commits November 15, 2021 17:58

rearrange the code

6ec153d

include local search method ADADELTA

17b023e

resolve warnings

5bf663b

fix bugs

82a7635

use parallel approach for kernel1 and 4

77c6a80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omp offload #165

Omp offload #165

vlkale commented Nov 13, 2021 •

edited

Loading

atillack commented Nov 13, 2021

vlkale commented Feb 4, 2022

diogomart commented Feb 4, 2022

vlkale commented Feb 26, 2022

Omp offload #165

Are you sure you want to change the base?

Omp offload #165

Conversation

vlkale commented Nov 13, 2021 • edited Loading

atillack commented Nov 13, 2021

vlkale commented Feb 4, 2022

diogomart commented Feb 4, 2022

vlkale commented Feb 26, 2022

vlkale commented Nov 13, 2021 •

edited

Loading