-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI support #698
base: master
Are you sure you want to change the base?
MPI support #698
Conversation
Pull Request Test Coverage Report for Build 10395158160Details
💛 - Coveralls |
Probably also want to allow specifying MPI options like the hosts to run on. |
I'll give this a try today. Looks interesting. I assume PySRRegressor also supports |
No MPI joy. I built the branch on a Cluster Toolkit deployed Slurm cluster. Those clusters use Open MPI and/or Intel MPI (which offers better stability/performance). It looks like PySR is using MPICH. The error is here |
Just doing Slurm did not succeed either. I got this error after doing a 0.19.3 install via |
The MPI one I'm not sure about but I think the Slurm one should* work as I've usually been able to have it working on my cluster. Things to keep in mind: PySR will run Then, the error message you are seeing:
This is strange because it means the workers are not activating the right environment. Do you know if the workers all have access to the same folder, across nodes? Or is it a different file system? |
The trick is to set There are a few more things that maybe I should create issues for:
|
According to the docs for
|
So far the distributed support of PySR has relied on ClusterManagers.jl. This PR adds MPIClusterManagers.jl (and MPI.jl) which should make PySR more compatible across clusters, since MPI.jl is standardized.
@wkharold I'd be interested to hear if this works for your cluster. You can use it with:
Note the command runs
mpirun
internally so you only need to launch the job on the head node of a slurm allocation, and it will "spread out" over the job.