Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omp offload #165

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from
Open

Omp offload #165

wants to merge 24 commits into from

Conversation

vlkale
Copy link

@vlkale vlkale commented Nov 13, 2021

This pull request contains the following proposed contributions to AutoDock-GPU:

(1) OpenMP parallelization of the AutoDock-GPU application's work on a GPU instead of CUDA parallelization of work on a GPU and experimentation on Summit with LLVM 14's OpenMP implementation.

(2) The use of OpenMP parallelization of the AutoDock-GPU application's work to multiple GPUs of a node through (i) having multiple threads on the CPU invoke target regions to the GPU specified through the device clause on the target region (either at compile-time or runtime) and (ii) using a task-to-device scheduling library that allows a thread to dynamically select at runtime the device to run on based on the state of the GPUs (we focus on occupancy and load). We note that a task-to-GPU scheduling strategies are based on those being developed in the SOLLVE project for a variety of applications.


On Summit with 3 GPUs: Our original OpenMP version of AutoDock-GPU with a round-robin scheduling generated the correct results compared to the CUDA version. Through performance optimizations over our original OpenMP version in the last 3 weeks and using a task-to-GPU scheduling strategy that picks a random GPU, we have gotten the OpenMP version to improve 11x. The OpenMP version is still 4x slower than the CUDA version. Using nvprof on both versions, we see that most (~99%) of the application's execution time is spent in kernel3's 'gpu_perform_LS_kernel(float*, float*). We think our OpenMP version is slower primarily because of an manually optimized simd reduction that the CUDA version is using - we are having trouble at the moment translating it efficiently in OpenMP version (we just use the reduction in the target directive). We are looking to use the invoking CUDA code for the reduction within the OpenMP target region if it's possible, and also considering just invoking all CUDA code within the OpenMP target regions to see what performance we get doing that, among other optimizations to the OpenMP version.

Nevertheless, we believe that these changes hold significant promise for (a) the use of OpenMP in AutoDock-GPU and for (b) using multi-device scheduling for load balancing in particular through the use of OpenMP and its tasking capabilities.

Consider this an initial pull request to inform of our development. We expect more updates to this pull request in the coming few weeks.

We have been in contact with @atillack and Stefano Forli from Scripps in the last couple of months. Please reach out to Mathialakan @mathialakan or myself @vlkale for questions on the code.

@atillack
Copy link
Member

@vlkale @mathialakan Thank you. I look forward testing it out and working on it.

@vlkale
Copy link
Author

vlkale commented Feb 4, 2022

@atillack Please let us know about input testing sets you are using as we are curious about that.

@diogomart
Copy link
Member

@vlkale I think you may be referring to the E50 plots, which are described in J. Chem. Theory Comput. 2021, 17, 2, 1060–1073. See for example #139 (comment).

@vlkale
Copy link
Author

vlkale commented Feb 26, 2022

Thanks for this earlier and sorry I didn't follow up and reply here sooner. @mathialakan and I looked at this and it is helpful to us. We wanted a representative and reasonable input data set that we can use to do more tests of this fork. I have spoken to @mathialakan and he may have a few more things to say here.

@vlkale I think you may be referring to the E50 plots, which are described in J. Chem. Theory Comput. 2021, 17, 2, 1060–1073. See for example #139 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants