Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI support #698

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

MPI support #698

wants to merge 10 commits into from

Conversation

MilesCranmer
Copy link
Owner

So far the distributed support of PySR has relied on ClusterManagers.jl. This PR adds MPIClusterManagers.jl (and MPI.jl) which should make PySR more compatible across clusters, since MPI.jl is standardized.

@wkharold I'd be interested to hear if this works for your cluster. You can use it with:

model = PySRRegressor(multithreading=False, procs=num_nodes*num_cores, cluster_manager="mpi")

Note the command runs mpirun internally so you only need to launch the job on the head node of a slurm allocation, and it will "spread out" over the job.

@coveralls
Copy link

coveralls commented Aug 13, 2024

Pull Request Test Coverage Report for Build 10395158160

Details

  • 26 of 30 (86.67%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.02%) to 93.719%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pysr/julia_extensions.py 16 18 88.89%
pysr/julia_helpers.py 7 9 77.78%
Totals Coverage Status
Change from base Build 10346074402: -0.02%
Covered Lines: 1149
Relevant Lines: 1226

💛 - Coveralls

@MilesCranmer
Copy link
Owner Author

Probably also want to allow specifying MPI options like the hosts to run on.

@wkharold
Copy link
Contributor

I'll give this a try today. Looks interesting. I assume PySRRegressor also supports cluster_manager="slurm". Both of these approaches are interesting/valuable. Note that MPI will be quite sensitive to the network topology.

@wkharold
Copy link
Contributor

No MPI joy. I built the branch on a Cluster Toolkit deployed Slurm cluster. Those clusters use Open MPI and/or Intel MPI (which offers better stability/performance). It looks like PySR is using MPICH. The error is here

@wkharold
Copy link
Contributor

Just doing Slurm did not succeed either. I got this error after doing a 0.19.3 install via pip.

@MilesCranmer
Copy link
Owner Author

MilesCranmer commented Aug 15, 2024

The MPI one I'm not sure about but I think the Slurm one should* work as I've usually been able to have it working on my cluster.

Things to keep in mind: PySR will run srun for you, so you just need to call the script a single time on the head node from within a slurm allocation. It will internally dispatch using srun and set up the network of workers. i.e., it's a bit different from how MPI works and you would manually launch the workers yourself.

Then, the error message you are seeing:

ArgumentError: Package SymbolicRegression not found in current path.
- Run `import Pkg; Pkg.add("SymbolicRegression")` to install the SymbolicRegression package.

This is strange because it means the workers are not activating the right environment. Do you know if the workers all have access to the same folder, across nodes? Or is it a different file system?

@wkharold
Copy link
Contributor

I think the Slurm one should* work as I've usually been able to have it working on my cluster.

The trick is to set JULIA_PROJECT appropriately, otherwise as you mentioned the right environment is not activated.

There are a few more things that maybe I should create issues for:

  • the Google Cluster Toolkit can create partitions whose nodes are spun up on demand, the 60 timeout is too short in that case
  • it would be nice to be able to add additional srun switches, e.g., --ntasks-per-node, --cpus-per-task etc.
  • when running pysr from a container on a cluster (with the appropriate --binds) srun needs to invoke julia via the container rather than directly from the file system (since it won't be there)

@wkharold
Copy link
Contributor

when running pysr from a container on a cluster (with the appropriate --binds) srun needs to invoke julia via the container rather than directly from the file system (since it won't be there)

According to the docs for Distributed.addprocs the exename: keyword argument specifies the name of the Julia executable. To be able to fully containerize PySR

  • the container's default action should be to run Julia, e.g., in Apptainer
%runscript: 
   /opt/julia-1.10.4/bin/julia
  • the PySR interface should allow users to specify the exename: keyword argument

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants