MPI support #698

MilesCranmer · 2024-08-13T08:59:15Z

So far the distributed support of PySR has relied on ClusterManagers.jl. This PR adds MPIClusterManagers.jl (and MPI.jl) which should make PySR more compatible across clusters, since MPI.jl is standardized.

@wkharold I'd be interested to hear if this works for your cluster. You can use it with:

model = PySRRegressor(multithreading=False, procs=num_nodes*num_cores, cluster_manager="mpi")

Note the command runs mpirun internally so you only need to launch the job on the head node of a slurm allocation, and it will "spread out" over the job.

coveralls · 2024-08-13T09:28:41Z

Pull Request Test Coverage Report for Build 10395158160

Details

26 of 30 (86.67%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.02%) to 93.719%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pysr/julia_extensions.py	16	18	88.89%
pysr/julia_helpers.py	7	9	77.78%

Totals
Change from base Build 10346074402:	-0.02%
Covered Lines:	1149
Relevant Lines:	1226

💛 - Coveralls

MilesCranmer · 2024-08-14T10:32:33Z

Probably also want to allow specifying MPI options like the hosts to run on.

wkharold · 2024-08-14T13:26:30Z

I'll give this a try today. Looks interesting. I assume PySRRegressor also supports cluster_manager="slurm". Both of these approaches are interesting/valuable. Note that MPI will be quite sensitive to the network topology.

wkharold · 2024-08-15T17:21:01Z

No MPI joy. I built the branch on a Cluster Toolkit deployed Slurm cluster. Those clusters use Open MPI and/or Intel MPI (which offers better stability/performance). It looks like PySR is using MPICH. The error is here

wkharold · 2024-08-15T17:24:12Z

Just doing Slurm did not succeed either. I got this error after doing a 0.19.3 install via pip.

MilesCranmer · 2024-08-15T21:41:11Z

The MPI one I'm not sure about but I think the Slurm one should* work as I've usually been able to have it working on my cluster.

Things to keep in mind: PySR will run srun for you, so you just need to call the script a single time on the head node from within a slurm allocation. It will internally dispatch using srun and set up the network of workers. i.e., it's a bit different from how MPI works and you would manually launch the workers yourself.

Then, the error message you are seeing:

ArgumentError: Package SymbolicRegression not found in current path.
- Run `import Pkg; Pkg.add("SymbolicRegression")` to install the SymbolicRegression package.

This is strange because it means the workers are not activating the right environment. Do you know if the workers all have access to the same folder, across nodes? Or is it a different file system?

wkharold · 2024-08-16T16:38:24Z

I think the Slurm one should* work as I've usually been able to have it working on my cluster.

The trick is to set JULIA_PROJECT appropriately, otherwise as you mentioned the right environment is not activated.

There are a few more things that maybe I should create issues for:

the Google Cluster Toolkit can create partitions whose nodes are spun up on demand, the 60 timeout is too short in that case
it would be nice to be able to add additional srun switches, e.g., --ntasks-per-node, --cpus-per-task etc.
when running pysr from a container on a cluster (with the appropriate --binds) srun needs to invoke julia via the container rather than directly from the file system (since it won't be there)

wkharold · 2024-08-18T20:42:15Z

when running pysr from a container on a cluster (with the appropriate --binds) srun needs to invoke julia via the container rather than directly from the file system (since it won't be there)

According to the docs for Distributed.addprocs the exename: keyword argument specifies the name of the Julia executable. To be able to fully containerize PySR

the container's default action should be to run Julia, e.g., in Apptainer

%runscript: 
   /opt/julia-1.10.4/bin/julia

the PySR interface should allow users to specify the exename: keyword argument

MilesCranmer added 9 commits August 13, 2024 09:38

refactor: extensions to install at one time

701b5a2

feat: add mpi cluster manager backend

c00e26e

docs: describe MPI backend in docstring

5ee2d0a

docs: mention cluster_manager="mpi"

8e92bf1

fix: loading of all packages

bfbd775

test: include test of MPIWorkerManager

5365c6b

refactor: clean up tests of mpi

5e9b679

fix: issue with addprocs function undefined

09c2db5

fix: install cluster_manager at correct step

b3f28d5

feat: create mpiflags option

dca76ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI support #698

MPI support #698

MilesCranmer commented Aug 13, 2024

coveralls commented Aug 13, 2024 •

edited

Loading

MilesCranmer commented Aug 14, 2024

wkharold commented Aug 14, 2024

wkharold commented Aug 15, 2024

wkharold commented Aug 15, 2024

MilesCranmer commented Aug 15, 2024 •

edited

Loading

wkharold commented Aug 16, 2024

wkharold commented Aug 18, 2024

MPI support #698

Are you sure you want to change the base?

MPI support #698

Conversation

MilesCranmer commented Aug 13, 2024

coveralls commented Aug 13, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10395158160

Details

💛 - Coveralls

MilesCranmer commented Aug 14, 2024

wkharold commented Aug 14, 2024

wkharold commented Aug 15, 2024

wkharold commented Aug 15, 2024

MilesCranmer commented Aug 15, 2024 • edited Loading

wkharold commented Aug 16, 2024

wkharold commented Aug 18, 2024

coveralls commented Aug 13, 2024 •

edited

Loading

MilesCranmer commented Aug 15, 2024 •

edited

Loading