GPU Assignment (#3382)

When there are multiple GPU devices visible to an MPI process, we need to make a decision on which device to use. In the current development branch, we create a local sub-communicator using something like MPI_COMM_TYPE_SHARED and assign the devices in a round-robin way based on the local rank. This however does not always work. In this WarpX issue (ECP-WarpX/WarpX#3967), it was reported that all processes in a node used the same GPU because the sub-communicator had only one process in it. In this PR, we use MPI_Get_processor_name as an alternative way of obtaining the ranks within an actual node and use that for GPU assignment. If both approaches fail, we will fall back to use the rank in the global communicator modulo the number of devices. In this PR, we have also removed AMREX_GPUS_PER_SOCKET and AMREX_GPUS_PER_NODE. They have not been used for some time now.
AMReX-Codes · Jul 27, 2023 · f5bf0e4 · f5bf0e4
1 parent 34c0ae3
commit f5bf0e4
Show file tree

Hide file tree

Showing 13 changed files with 78 additions and 153 deletions.
diff --git a/Docs/sphinx_documentation/source/GPU.rst b/Docs/sphinx_documentation/source/GPU.rst
@@ -1685,14 +1685,7 @@ AMReX for GPUs:
   AMReX will attempt to do the best job it can assigning MPI ranks to GPUs by
   doing round robin assignment. This may be suboptimal because this assignment
   scheme would not be aware of locality benefits that come from having an MPI
-  rank be on the same socket as the GPU it is managing. If you know the hardware
-  layout of the system you're running on, specifically the number of GPUs per
-  socket (`M`) and number of GPUs per node (`N`), you can set the preprocessor
-  defines `-DAMREX_GPUS_PER_SOCKET=M` and `-DAMREX_GPUS_PER_NODE=N`, which are
-  exposed in the GNU Make system through the variables `GPUS_PER_SOCKET` and
-  `GPUS_PER_NODE` respectively (see an example in `Tools/GNUMake/sites/Make.olcf`).
-  Then AMReX can ensure that each MPI rank selects a GPU on the same socket as
-  that rank (assuming your MPI implementation supports MPI 3.)
+  rank be on the same socket as the GPU it is managing.
 
 
 .. ===================================================================

diff --git a/Src/Base/AMReX_GpuDevice.cpp b/Src/Base/AMReX_GpuDevice.cpp
@@ -181,93 +181,16 @@ Device::Initialize ()
         device_id = 0;
     }
     else {
-
-        // ifdef the following against MPI so it compiles, but note
-        // that we can only get here if using more than one processor,
-        // which requires MPI.
-
-#ifdef BL_USE_MPI
-
-        // Create a communicator out of only the ranks sharing GPUs.
-        // The default assumption is that this is all the ranks on the
-        // same node, and to get that we'll use the MPI-3.0 split that
-        // looks for shared memory communicators (and we'll error out
-        // if that standard is unsupported).
-
-#if MPI_VERSION < 3
-        amrex::Abort("When using GPUs with MPI, if multiple devices are visible to each rank, MPI-3.0 must be supported.");
-#endif
-
-        // However, it's possible that the ranks sharing GPUs will be
-        // confined to a single socket rather than a full node. Indeed,
-        // this is often the optimal configuration; for example, on Summit,
-        // a good configuration using jsrun is one resource set per
-        // socket (two per node), with three GPUs per resource set.
-        // To deal with this where we can, we'll take advantage of OpenMPI's
-        // specialized split by socket. However, we only want to do this
-        // if in fact our resource set is confined to the socket.
-        // To make this determination we need to have system information,
-        // which is provided by the build system for the systems
-        // we know about. The simple heuristic we'll use to determine
-        // this is if the number of visible devices is smaller than
-        // the known number of GPUs per socket.
-
-#if defined(AMREX_USE_CUDA)
-#if (!defined(AMREX_GPUS_PER_SOCKET) && !defined(AMREX_GPUS_PER_NODE))
-        if (amrex::Verbose()) {
-            amrex::Warning("Multiple GPUs are visible to each MPI rank, but the number of GPUs per socket or node has not been provided.\n"
-                           "This may lead to incorrect or suboptimal rank-to-GPU mapping.");
+        if (amrex::Verbose() && ParallelDescriptor::IOProcessor()) {
+            amrex::Warning("Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.");
         }
-#endif
-#endif
-
-        MPI_Comm local_comm;
-
-        int split_type;
-
-#if (defined(OPEN_MPI) && defined(AMREX_GPUS_PER_SOCKET))
-        if (gpu_device_count <= AMREX_GPUS_PER_SOCKET)
-            split_type = OMPI_COMM_TYPE_SOCKET;
-        else
-            split_type = OMPI_COMM_TYPE_NODE;
-#else
-        split_type = MPI_COMM_TYPE_SHARED;
-#endif
-
-        // We have no preference on how ranks get ordered within this communicator.
-        int key = 0;
-
-        MPI_Comm_split_type(ParallelDescriptor::Communicator(), split_type, key, MPI_INFO_NULL, &local_comm);
-
-        // Get rank within the local communicator, and number of ranks.
-        MPI_Comm_size(local_comm, &n_local_procs);
-
-        int my_rank;
-        MPI_Comm_rank(local_comm, &my_rank);
-
-        // Free the local communicator.
-        MPI_Comm_free(&local_comm);
-
-        // For each rank that shares a GPU, use round-robin assignment
-        // to assign MPI ranks to GPUs. We will arbitrarily assign
-        // ranks to GPUs, assuming that socket awareness has already
-        // been handled.
-
-        device_id = my_rank % gpu_device_count;
-
-        // If we detect more ranks than visible GPUs, warn the user
-        // that this will fail in the case where the devices are
-        // set to exclusive process mode and MPS is not enabled.
-
-        if (n_local_procs > gpu_device_count && amrex::Verbose()) {
-            amrex::Print() << "Mapping more than one rank per GPU. This will fail if the GPUs are in exclusive process mode\n"
-                           << "and MPS is not enabled. In that case you will see an error such as: 'all CUDA-capable devices are\n"
-                           << "busy'. To resolve that issue, set the GPUs to the default compute mode, or enable MPS. If you are\n"
-                           << "on a cluster, please consult the system user guide for how to launch your job in this configuration.\n";
+        if (ParallelDescriptor::NProcsPerNode() == gpu_device_count) {
+            device_id = ParallelDescriptor::MyRankInNode();
+        } else if (ParallelDescriptor::NProcsPerProcessor() == gpu_device_count) {
+            device_id = ParallelDescriptor::MyRankInProcessor();
+        } else {
+            device_id = ParallelDescriptor::MyProc() % gpu_device_count;
         }
-
-#endif   // BL_USE_MPI
-
     }
 
     AMREX_HIP_OR_CUDA(AMREX_HIP_SAFE_CALL (hipSetDevice(device_id));,
@@ -376,6 +299,9 @@ Device::Initialize ()
                        << " initialized with " << num_devices_used
                        << ((num_devices_used == 1) ? " device.\n"
                                                    : " devices.\n");
+        if (num_devices_used < ParallelDescriptor::NProcs() && ParallelDescriptor::IOProcessor()) {
+            amrex::Warning("There are more MPI processes than the number of GPUs.");
+        }
     }
 
 #if defined(AMREX_USE_CUDA) && (defined(AMREX_PROFILING) || defined(AMREX_TINY_PROFILING))

diff --git a/Src/Base/AMReX_ParallelDescriptor.H b/Src/Base/AMReX_ParallelDescriptor.H
@@ -210,10 +210,29 @@ while ( false )
     inline MPI_Comm Communicator () noexcept { return m_comm; }
 
     extern AMREX_EXPORT int m_nprocs_per_node;
-    //! Return the number of MPI ranks per node.  This may not be correct if
-    //! MPI_COMM_TYPE_SHARED groups processes across nodes.
+    //! Return the number of MPI ranks per node as defined by
+    //! MPI_COMM_TYPE_SHARED. This might be the same or different from
+    //! NProcsPerProcessor based on MPI_Get_processor_name.
     inline int NProcsPerNode () noexcept { return m_nprocs_per_node; }
 
+    extern AMREX_EXPORT int m_rank_in_node;
+    //! Return the rank in a node defined by MPI_COMM_TYPE_SHARED. This
+    //! might be the same or different from MyRankInProcessor based on
+    //! MPI_Get_processor_name.
+    inline int MyRankInNode () noexcept { return m_rank_in_node; }
+
+    extern AMREX_EXPORT int m_nprocs_per_processor;
+    //! Return the number of MPI ranks per node as defined by
+    //! MPI_Get_processor_name. This might be the same or different from
+    //! NProcsPerNode based on MPI_COMM_TYPE_SHARED.
+    inline int NProcsPerProcessor () noexcept { return m_nprocs_per_processor; }
+
+    extern AMREX_EXPORT int m_rank_in_processor;
+    //! Return the rank in a node defined by MPI_Get_processor_name. This
+    //! might be the same or different from MyRankInNode based on
+    //! MPI_COMM_TYPE_SHARED.
+    inline int MyRankInProcessor () noexcept { return m_rank_in_processor; }
+
 #ifdef AMREX_USE_MPI
     extern Vector<MPI_Datatype*> m_mpi_types;
     extern Vector<MPI_Op*> m_mpi_ops;

diff --git a/Src/Base/AMReX_ParallelDescriptor.cpp b/Src/Base/AMReX_ParallelDescriptor.cpp
@@ -28,6 +28,7 @@
 #include <cstdio>
 #include <cstddef>
 #include <cstdlib>
+#include <cstring>
 #include <iostream>
 #include <fstream>
 #include <sstream>
@@ -66,6 +67,10 @@ namespace amrex::ParallelDescriptor {
     MPI_Comm m_comm = MPI_COMM_NULL;    // communicator for all ranks, probably MPI_COMM_WORLD
 
     int m_nprocs_per_node = 1;
+    int m_rank_in_node = 0;
+
+    int m_nprocs_per_processor = 1;
+    int m_rank_in_processor = 0;
 
 #ifdef AMREX_USE_MPI
     Vector<MPI_Datatype*> m_mpi_types;
@@ -326,15 +331,47 @@ StartParallel (int* argc, char*** argv, MPI_Comm a_mpi_comm)
 
     ParallelContext::push(m_comm);
 
+    if (ParallelDescriptor::NProcs() > 1)
+    {
 #if defined(OPEN_MPI)
-    int split_type = OMPI_COMM_TYPE_NODE;
+        int split_type = OMPI_COMM_TYPE_NODE;
 #else
-    int split_type = MPI_COMM_TYPE_SHARED;
+        int split_type = MPI_COMM_TYPE_SHARED;
 #endif
-    MPI_Comm node_comm;
-    MPI_Comm_split_type(m_comm, split_type, 0, MPI_INFO_NULL, &node_comm);
-    MPI_Comm_size(node_comm, &m_nprocs_per_node);
-    MPI_Comm_free(&node_comm);
+        MPI_Comm node_comm;
+        MPI_Comm_split_type(m_comm, split_type, 0, MPI_INFO_NULL, &node_comm);
+        MPI_Comm_size(node_comm, &m_nprocs_per_node);
+        MPI_Comm_rank(node_comm, &m_rank_in_node);
+        MPI_Comm_free(&node_comm);
+
+        char procname[MPI_MAX_PROCESSOR_NAME];
+        int lenname;
+        BL_MPI_REQUIRE(MPI_Get_processor_name(procname, &lenname));
+        procname[lenname++] = '\0';
+        const int nranks = ParallelDescriptor::NProcs();
+        Vector<int> lenvec(nranks);
+        MPI_Allgather(&lenname, 1, MPI_INT, lenvec.data(), 1, MPI_INT, m_comm);
+        Vector<int> offset(nranks,0);
+        Long len_tot = lenvec[0];
+        for (int i = 1; i < nranks; ++i) {
+            offset[i] = offset[i-1] + lenvec[i-1];
+            len_tot += lenvec[i];
+        }
+        AMREX_ALWAYS_ASSERT(len_tot <= static_cast<Long>(std::numeric_limits<int>::max()));
+        Vector<char> recv_buffer(len_tot);
+        MPI_Allgatherv(procname, lenname, MPI_CHAR,
+                       recv_buffer.data(), lenvec.data(), offset.data(), MPI_CHAR, m_comm);
+        m_nprocs_per_processor = 0;
+        for (int i = 0; i < nranks; ++i) {
+            if (lenname == lenvec[i] && std::strcmp(procname, recv_buffer.data()+offset[i]) == 0) {
+                if (i == ParallelDescriptor::MyProc()) {
+                    m_rank_in_processor = m_nprocs_per_processor;
+                }
+                ++m_nprocs_per_processor;
+            }
+        }
+        AMREX_ASSERT(m_nprocs_per_processor > 0);
+    }
 
     // Create these types outside OMP parallel region
     auto t1 = Mpi_typemap<IntVect>::type(); // NOLINT

diff --git a/Tools/CMake/AMReXOptions.cmake b/Tools/CMake/AMReXOptions.cmake
@@ -230,14 +230,6 @@ if (AMReX_HIP)
    endif ()
 endif ()
 
-if (AMReX_CUDA OR AMReX_HIP)
-   set(GPUS_PER_SOCKET "IGNORE" CACHE STRING "Number of GPUs per socket" )
-   print_option(GPUS_PER_SOCKET)
-
-   set(GPUS_PER_NODE "IGNORE" CACHE STRING "Number of GPUs per node" )
-   print_option(GPUS_PER_NODE)
-endif ()
-
 #
 # GPU RDC support
 #

diff --git a/Tools/CMake/AMReXSetDefines.cmake b/Tools/CMake/AMReXSetDefines.cmake
@@ -157,12 +157,6 @@ if (NOT AMReX_GPU_BACKEND STREQUAL NONE)
 endif()
 
 if (AMReX_CUDA OR AMReX_HIP)
-   add_amrex_define( AMREX_GPUS_PER_SOCKET=${GPUS_PER_SOCKET}
-      NO_LEGACY IF GPUS_PER_SOCKET)
-
-   add_amrex_define( AMREX_GPUS_PER_NODE=${GPUS_PER_NODE}
-      NO_LEGACY IF GPUS_PER_NODE)
-
    add_amrex_define( AMREX_USE_GPU_RDC NO_LEGACY IF AMReX_GPU_RDC )
 endif ()
 

diff --git a/Tools/CMake/AMReX_Config_ND.H.in b/Tools/CMake/AMReX_Config_ND.H.in
@@ -53,8 +53,6 @@
 #cmakedefine AMREX_USE_ACC
 #cmakedefine AMREX_USE_GPU
 #cmakedefine BL_COALESCE_FABS
-#cmakedefine AMREX_GPUS_PER_SOCKET @AMREX_GPUS_PER_SOCKET@
-#cmakedefine AMREX_GPUS_PER_NODE   @AMREX_GPUS_PER_NODE@
 #cmakedefine AMREX_USE_GPU_RDC
 #cmakedefine AMREX_PARTICLES
 #cmakedefine AMREX_USE_HDF5

diff --git a/Tools/GNUMake/Make.defs b/Tools/GNUMake/Make.defs
@@ -1150,16 +1150,6 @@ else ifeq ($(USE_CUDA),TRUE)
 
     endif
 
-    # Provide system configuration, if available.
-
-    ifdef GPUS_PER_SOCKET
-       DEFINES += -DAMREX_GPUS_PER_SOCKET=$(GPUS_PER_SOCKET)
-    endif
-
-    ifdef GPUS_PER_NODE
-       DEFINES += -DAMREX_GPUS_PER_NODE=$(GPUS_PER_NODE)
-    endif
-
     ifneq ($(LINK_WITH_FORTRAN_COMPILER),TRUE)
       LINKFLAGS = $(NVCC_FLAGS) $(CXXFLAGS_FROM_HOST)
       AMREX_LINKER = nvcc

diff --git a/Tools/GNUMake/sites/Make.alcf b/Tools/GNUMake/sites/Make.alcf
@@ -74,12 +74,6 @@ ifeq ($(which_computer),$(filter $(which_computer),polaris))
     else
         $(error No CUDA_ROOT nor CUDA_HOME nor CUDA_PATH found. Please load a cuda module.)
     endif
-
-    # Provide system configuration information.
-
-    GPUS_PER_NODE=4
-    GPUS_PER_SOCKET=4
-
   endif
 
-endif
+endif
diff --git a/Tools/GNUMake/sites/Make.llnl b/Tools/GNUMake/sites/Make.llnl
@@ -46,11 +46,6 @@ ifeq ($(which_computer),$(filter $(which_computer),ray rzmanta))
     CUDA_ARCH = 60
     COMPILE_CUDA_PATH = $(CUDA_HOME)
 
-    # Provide system configuration information.
-
-    GPUS_PER_NODE=4
-    GPUS_PER_SOCKET=2
-
   endif
 
   ifeq ($(lowercase_comp),gnu)
@@ -101,11 +96,6 @@ ifeq ($(which_computer),$(filter $(which_computer),sierra butte rzansel lassen))
     CUDA_ARCH = 70
     COMPILE_CUDA_PATH = $(CUDA_HOME)
 
-    # Provide system configuration information.
-
-    GPUS_PER_NODE=4
-    GPUS_PER_SOCKET=2
-
   endif
 
   ifeq ($(lowercase_comp),gnu)

diff --git a/Tools/GNUMake/sites/Make.nersc b/Tools/GNUMake/sites/Make.nersc
@@ -165,8 +165,7 @@ ifeq ($(which_computer),$(filter $(which_computer),cgpu))
     else
       CUDA_ARCH = 70
     endif
-    GPUS_PER_NODE = 8
-    GPUS_PER_SOCKET = 4
+
   endif
 
   ifeq ($(USE_SENSEI_INSITU),TRUE)

diff --git a/Tools/GNUMake/sites/Make.nrel b/Tools/GNUMake/sites/Make.nrel
@@ -27,8 +27,6 @@ ifeq ($(which_computer), eagle)
         COMPILE_CUDA_PATH := $(CUDA_HOME)
     endif
     CUDA_ARCH = 70
-    GPUS_PER_NODE = 2
-    GPUS_PER_SOCKET = 1
   endif
 else ifeq ($(which_computer), rhodes)
 # Rhodes is dedicated single node machine for testing

diff --git a/Tools/GNUMake/sites/Make.olcf b/Tools/GNUMake/sites/Make.olcf
@@ -38,11 +38,6 @@ ifeq ($(which_computer),$(filter $(which_computer),summit ascent))
   CUDA_ARCH = 70
   COMPILE_CUDA_PATH = $(OLCF_CUDA_ROOT)
 
-  # Provide system configuration information.
-
-  GPUS_PER_NODE=6
-  GPUS_PER_SOCKET=3
-
 endif
 
 ifeq ($(which_computer),spock)