Skip to content

Commit

Permalink
GPU Assignment (#3382)
Browse files Browse the repository at this point in the history
When there are multiple GPU devices visible to an MPI process, we need
to make a decision on which device to use. In the current development
branch, we create a local sub-communicator using something like
MPI_COMM_TYPE_SHARED and assign the devices in a round-robin way based
on the local rank. This however does not always work. In this WarpX
issue
(ECP-WarpX/WarpX#3967), it was reported that
all processes in a node used the same GPU because the sub-communicator
had only one process in it. In this PR, we use MPI_Get_processor_name as
an alternative way of obtaining the ranks within an actual node and use
that for GPU assignment. If both approaches fail, we will fall back to
use the rank in the global communicator modulo the number of devices.

In this PR, we have also removed AMREX_GPUS_PER_SOCKET and
AMREX_GPUS_PER_NODE. They have not been used for some time now.
  • Loading branch information
WeiqunZhang authored Jul 27, 2023
1 parent 34c0ae3 commit f5bf0e4
Show file tree
Hide file tree
Showing 13 changed files with 78 additions and 153 deletions.
9 changes: 1 addition & 8 deletions Docs/sphinx_documentation/source/GPU.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1685,14 +1685,7 @@ AMReX for GPUs:
AMReX will attempt to do the best job it can assigning MPI ranks to GPUs by
doing round robin assignment. This may be suboptimal because this assignment
scheme would not be aware of locality benefits that come from having an MPI
rank be on the same socket as the GPU it is managing. If you know the hardware
layout of the system you're running on, specifically the number of GPUs per
socket (`M`) and number of GPUs per node (`N`), you can set the preprocessor
defines `-DAMREX_GPUS_PER_SOCKET=M` and `-DAMREX_GPUS_PER_NODE=N`, which are
exposed in the GNU Make system through the variables `GPUS_PER_SOCKET` and
`GPUS_PER_NODE` respectively (see an example in `Tools/GNUMake/sites/Make.olcf`).
Then AMReX can ensure that each MPI rank selects a GPU on the same socket as
that rank (assuming your MPI implementation supports MPI 3.)
rank be on the same socket as the GPU it is managing.


.. ===================================================================
Expand Down
96 changes: 11 additions & 85 deletions Src/Base/AMReX_GpuDevice.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -181,93 +181,16 @@ Device::Initialize ()
device_id = 0;
}
else {

// ifdef the following against MPI so it compiles, but note
// that we can only get here if using more than one processor,
// which requires MPI.

#ifdef BL_USE_MPI

// Create a communicator out of only the ranks sharing GPUs.
// The default assumption is that this is all the ranks on the
// same node, and to get that we'll use the MPI-3.0 split that
// looks for shared memory communicators (and we'll error out
// if that standard is unsupported).

#if MPI_VERSION < 3
amrex::Abort("When using GPUs with MPI, if multiple devices are visible to each rank, MPI-3.0 must be supported.");
#endif

// However, it's possible that the ranks sharing GPUs will be
// confined to a single socket rather than a full node. Indeed,
// this is often the optimal configuration; for example, on Summit,
// a good configuration using jsrun is one resource set per
// socket (two per node), with three GPUs per resource set.
// To deal with this where we can, we'll take advantage of OpenMPI's
// specialized split by socket. However, we only want to do this
// if in fact our resource set is confined to the socket.
// To make this determination we need to have system information,
// which is provided by the build system for the systems
// we know about. The simple heuristic we'll use to determine
// this is if the number of visible devices is smaller than
// the known number of GPUs per socket.

#if defined(AMREX_USE_CUDA)
#if (!defined(AMREX_GPUS_PER_SOCKET) && !defined(AMREX_GPUS_PER_NODE))
if (amrex::Verbose()) {
amrex::Warning("Multiple GPUs are visible to each MPI rank, but the number of GPUs per socket or node has not been provided.\n"
"This may lead to incorrect or suboptimal rank-to-GPU mapping.");
if (amrex::Verbose() && ParallelDescriptor::IOProcessor()) {
amrex::Warning("Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.");
}
#endif
#endif

MPI_Comm local_comm;

int split_type;

#if (defined(OPEN_MPI) && defined(AMREX_GPUS_PER_SOCKET))
if (gpu_device_count <= AMREX_GPUS_PER_SOCKET)
split_type = OMPI_COMM_TYPE_SOCKET;
else
split_type = OMPI_COMM_TYPE_NODE;
#else
split_type = MPI_COMM_TYPE_SHARED;
#endif

// We have no preference on how ranks get ordered within this communicator.
int key = 0;

MPI_Comm_split_type(ParallelDescriptor::Communicator(), split_type, key, MPI_INFO_NULL, &local_comm);

// Get rank within the local communicator, and number of ranks.
MPI_Comm_size(local_comm, &n_local_procs);

int my_rank;
MPI_Comm_rank(local_comm, &my_rank);

// Free the local communicator.
MPI_Comm_free(&local_comm);

// For each rank that shares a GPU, use round-robin assignment
// to assign MPI ranks to GPUs. We will arbitrarily assign
// ranks to GPUs, assuming that socket awareness has already
// been handled.

device_id = my_rank % gpu_device_count;

// If we detect more ranks than visible GPUs, warn the user
// that this will fail in the case where the devices are
// set to exclusive process mode and MPS is not enabled.

if (n_local_procs > gpu_device_count && amrex::Verbose()) {
amrex::Print() << "Mapping more than one rank per GPU. This will fail if the GPUs are in exclusive process mode\n"
<< "and MPS is not enabled. In that case you will see an error such as: 'all CUDA-capable devices are\n"
<< "busy'. To resolve that issue, set the GPUs to the default compute mode, or enable MPS. If you are\n"
<< "on a cluster, please consult the system user guide for how to launch your job in this configuration.\n";
if (ParallelDescriptor::NProcsPerNode() == gpu_device_count) {
device_id = ParallelDescriptor::MyRankInNode();
} else if (ParallelDescriptor::NProcsPerProcessor() == gpu_device_count) {
device_id = ParallelDescriptor::MyRankInProcessor();
} else {
device_id = ParallelDescriptor::MyProc() % gpu_device_count;
}

#endif // BL_USE_MPI

}

AMREX_HIP_OR_CUDA(AMREX_HIP_SAFE_CALL (hipSetDevice(device_id));,
Expand Down Expand Up @@ -376,6 +299,9 @@ Device::Initialize ()
<< " initialized with " << num_devices_used
<< ((num_devices_used == 1) ? " device.\n"
: " devices.\n");
if (num_devices_used < ParallelDescriptor::NProcs() && ParallelDescriptor::IOProcessor()) {
amrex::Warning("There are more MPI processes than the number of GPUs.");
}
}

#if defined(AMREX_USE_CUDA) && (defined(AMREX_PROFILING) || defined(AMREX_TINY_PROFILING))
Expand Down
23 changes: 21 additions & 2 deletions Src/Base/AMReX_ParallelDescriptor.H
Original file line number Diff line number Diff line change
Expand Up @@ -210,10 +210,29 @@ while ( false )
inline MPI_Comm Communicator () noexcept { return m_comm; }

extern AMREX_EXPORT int m_nprocs_per_node;
//! Return the number of MPI ranks per node. This may not be correct if
//! MPI_COMM_TYPE_SHARED groups processes across nodes.
//! Return the number of MPI ranks per node as defined by
//! MPI_COMM_TYPE_SHARED. This might be the same or different from
//! NProcsPerProcessor based on MPI_Get_processor_name.
inline int NProcsPerNode () noexcept { return m_nprocs_per_node; }

extern AMREX_EXPORT int m_rank_in_node;
//! Return the rank in a node defined by MPI_COMM_TYPE_SHARED. This
//! might be the same or different from MyRankInProcessor based on
//! MPI_Get_processor_name.
inline int MyRankInNode () noexcept { return m_rank_in_node; }

extern AMREX_EXPORT int m_nprocs_per_processor;
//! Return the number of MPI ranks per node as defined by
//! MPI_Get_processor_name. This might be the same or different from
//! NProcsPerNode based on MPI_COMM_TYPE_SHARED.
inline int NProcsPerProcessor () noexcept { return m_nprocs_per_processor; }

extern AMREX_EXPORT int m_rank_in_processor;
//! Return the rank in a node defined by MPI_Get_processor_name. This
//! might be the same or different from MyRankInNode based on
//! MPI_COMM_TYPE_SHARED.
inline int MyRankInProcessor () noexcept { return m_rank_in_processor; }

#ifdef AMREX_USE_MPI
extern Vector<MPI_Datatype*> m_mpi_types;
extern Vector<MPI_Op*> m_mpi_ops;
Expand Down
49 changes: 43 additions & 6 deletions Src/Base/AMReX_ParallelDescriptor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
#include <cstdio>
#include <cstddef>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <fstream>
#include <sstream>
Expand Down Expand Up @@ -66,6 +67,10 @@ namespace amrex::ParallelDescriptor {
MPI_Comm m_comm = MPI_COMM_NULL; // communicator for all ranks, probably MPI_COMM_WORLD

int m_nprocs_per_node = 1;
int m_rank_in_node = 0;

int m_nprocs_per_processor = 1;
int m_rank_in_processor = 0;

#ifdef AMREX_USE_MPI
Vector<MPI_Datatype*> m_mpi_types;
Expand Down Expand Up @@ -326,15 +331,47 @@ StartParallel (int* argc, char*** argv, MPI_Comm a_mpi_comm)

ParallelContext::push(m_comm);

if (ParallelDescriptor::NProcs() > 1)
{
#if defined(OPEN_MPI)
int split_type = OMPI_COMM_TYPE_NODE;
int split_type = OMPI_COMM_TYPE_NODE;
#else
int split_type = MPI_COMM_TYPE_SHARED;
int split_type = MPI_COMM_TYPE_SHARED;
#endif
MPI_Comm node_comm;
MPI_Comm_split_type(m_comm, split_type, 0, MPI_INFO_NULL, &node_comm);
MPI_Comm_size(node_comm, &m_nprocs_per_node);
MPI_Comm_free(&node_comm);
MPI_Comm node_comm;
MPI_Comm_split_type(m_comm, split_type, 0, MPI_INFO_NULL, &node_comm);
MPI_Comm_size(node_comm, &m_nprocs_per_node);
MPI_Comm_rank(node_comm, &m_rank_in_node);
MPI_Comm_free(&node_comm);

char procname[MPI_MAX_PROCESSOR_NAME];
int lenname;
BL_MPI_REQUIRE(MPI_Get_processor_name(procname, &lenname));
procname[lenname++] = '\0';
const int nranks = ParallelDescriptor::NProcs();
Vector<int> lenvec(nranks);
MPI_Allgather(&lenname, 1, MPI_INT, lenvec.data(), 1, MPI_INT, m_comm);
Vector<int> offset(nranks,0);
Long len_tot = lenvec[0];
for (int i = 1; i < nranks; ++i) {
offset[i] = offset[i-1] + lenvec[i-1];
len_tot += lenvec[i];
}
AMREX_ALWAYS_ASSERT(len_tot <= static_cast<Long>(std::numeric_limits<int>::max()));
Vector<char> recv_buffer(len_tot);
MPI_Allgatherv(procname, lenname, MPI_CHAR,
recv_buffer.data(), lenvec.data(), offset.data(), MPI_CHAR, m_comm);
m_nprocs_per_processor = 0;
for (int i = 0; i < nranks; ++i) {
if (lenname == lenvec[i] && std::strcmp(procname, recv_buffer.data()+offset[i]) == 0) {
if (i == ParallelDescriptor::MyProc()) {
m_rank_in_processor = m_nprocs_per_processor;
}
++m_nprocs_per_processor;
}
}
AMREX_ASSERT(m_nprocs_per_processor > 0);
}

// Create these types outside OMP parallel region
auto t1 = Mpi_typemap<IntVect>::type(); // NOLINT
Expand Down
8 changes: 0 additions & 8 deletions Tools/CMake/AMReXOptions.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -230,14 +230,6 @@ if (AMReX_HIP)
endif ()
endif ()

if (AMReX_CUDA OR AMReX_HIP)
set(GPUS_PER_SOCKET "IGNORE" CACHE STRING "Number of GPUs per socket" )
print_option(GPUS_PER_SOCKET)

set(GPUS_PER_NODE "IGNORE" CACHE STRING "Number of GPUs per node" )
print_option(GPUS_PER_NODE)
endif ()

#
# GPU RDC support
#
Expand Down
6 changes: 0 additions & 6 deletions Tools/CMake/AMReXSetDefines.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -157,12 +157,6 @@ if (NOT AMReX_GPU_BACKEND STREQUAL NONE)
endif()

if (AMReX_CUDA OR AMReX_HIP)
add_amrex_define( AMREX_GPUS_PER_SOCKET=${GPUS_PER_SOCKET}
NO_LEGACY IF GPUS_PER_SOCKET)

add_amrex_define( AMREX_GPUS_PER_NODE=${GPUS_PER_NODE}
NO_LEGACY IF GPUS_PER_NODE)

add_amrex_define( AMREX_USE_GPU_RDC NO_LEGACY IF AMReX_GPU_RDC )
endif ()

Expand Down
2 changes: 0 additions & 2 deletions Tools/CMake/AMReX_Config_ND.H.in
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,6 @@
#cmakedefine AMREX_USE_ACC
#cmakedefine AMREX_USE_GPU
#cmakedefine BL_COALESCE_FABS
#cmakedefine AMREX_GPUS_PER_SOCKET @AMREX_GPUS_PER_SOCKET@
#cmakedefine AMREX_GPUS_PER_NODE @AMREX_GPUS_PER_NODE@
#cmakedefine AMREX_USE_GPU_RDC
#cmakedefine AMREX_PARTICLES
#cmakedefine AMREX_USE_HDF5
Expand Down
10 changes: 0 additions & 10 deletions Tools/GNUMake/Make.defs
Original file line number Diff line number Diff line change
Expand Up @@ -1150,16 +1150,6 @@ else ifeq ($(USE_CUDA),TRUE)

endif

# Provide system configuration, if available.

ifdef GPUS_PER_SOCKET
DEFINES += -DAMREX_GPUS_PER_SOCKET=$(GPUS_PER_SOCKET)
endif

ifdef GPUS_PER_NODE
DEFINES += -DAMREX_GPUS_PER_NODE=$(GPUS_PER_NODE)
endif

ifneq ($(LINK_WITH_FORTRAN_COMPILER),TRUE)
LINKFLAGS = $(NVCC_FLAGS) $(CXXFLAGS_FROM_HOST)
AMREX_LINKER = nvcc
Expand Down
8 changes: 1 addition & 7 deletions Tools/GNUMake/sites/Make.alcf
Original file line number Diff line number Diff line change
Expand Up @@ -74,12 +74,6 @@ ifeq ($(which_computer),$(filter $(which_computer),polaris))
else
$(error No CUDA_ROOT nor CUDA_HOME nor CUDA_PATH found. Please load a cuda module.)
endif

# Provide system configuration information.

GPUS_PER_NODE=4
GPUS_PER_SOCKET=4

endif

endif
endif
10 changes: 0 additions & 10 deletions Tools/GNUMake/sites/Make.llnl
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,6 @@ ifeq ($(which_computer),$(filter $(which_computer),ray rzmanta))
CUDA_ARCH = 60
COMPILE_CUDA_PATH = $(CUDA_HOME)

# Provide system configuration information.

GPUS_PER_NODE=4
GPUS_PER_SOCKET=2

endif

ifeq ($(lowercase_comp),gnu)
Expand Down Expand Up @@ -101,11 +96,6 @@ ifeq ($(which_computer),$(filter $(which_computer),sierra butte rzansel lassen))
CUDA_ARCH = 70
COMPILE_CUDA_PATH = $(CUDA_HOME)

# Provide system configuration information.

GPUS_PER_NODE=4
GPUS_PER_SOCKET=2

endif

ifeq ($(lowercase_comp),gnu)
Expand Down
3 changes: 1 addition & 2 deletions Tools/GNUMake/sites/Make.nersc
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,7 @@ ifeq ($(which_computer),$(filter $(which_computer),cgpu))
else
CUDA_ARCH = 70
endif
GPUS_PER_NODE = 8
GPUS_PER_SOCKET = 4

endif

ifeq ($(USE_SENSEI_INSITU),TRUE)
Expand Down
2 changes: 0 additions & 2 deletions Tools/GNUMake/sites/Make.nrel
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@ ifeq ($(which_computer), eagle)
COMPILE_CUDA_PATH := $(CUDA_HOME)
endif
CUDA_ARCH = 70
GPUS_PER_NODE = 2
GPUS_PER_SOCKET = 1
endif
else ifeq ($(which_computer), rhodes)
# Rhodes is dedicated single node machine for testing
Expand Down
5 changes: 0 additions & 5 deletions Tools/GNUMake/sites/Make.olcf
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,6 @@ ifeq ($(which_computer),$(filter $(which_computer),summit ascent))
CUDA_ARCH = 70
COMPILE_CUDA_PATH = $(OLCF_CUDA_ROOT)

# Provide system configuration information.

GPUS_PER_NODE=6
GPUS_PER_SOCKET=3

endif

ifeq ($(which_computer),spock)
Expand Down

0 comments on commit f5bf0e4

Please sign in to comment.