-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing MPIEvaluator: Run on multi-node HPC systems using mpi4py #299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor things at the code level
Thanks for initial review! I updated the loglevel in (60f1f9c) and added example logs (on INFO and DEBUG level) to the PR. Tomorrow I will be benchmarking and working on the documentation. If you have any specific requests (for models or scenarios to tests, or specific documentation to write), please let me know! |
It would be great to have a simple lake model example that can be run on an hpc. I guess this would need to take the form of a notebook because It needs to cover the python code and the batch script. Also, not essential for this PR, but for my understanding of what functionality is available, have you run any tests with a FileModel or any of its subclasses (e.g., NetLogoModel). |
It's good that I decided to do extensive performance testing, because I ran into a nasty bug that I likely wouldn't have noticed otherwise. Somehow every time the second initialization of a MPIEvaluator broke down, with errors like:
Which indicated no workers available. My first hypotheses was that the MPI pool wasn't properly shut down, so I tried about 138 different ways to shut it down. Apparently, that wasn't the problem. So I decided to break it down to it's simplest form, try to get it working, and then bisect between our then current, broken implementation and the working one. The minimal working example I initially found was this: import time
from mpi4py.futures import MPIPoolExecutor
def simple_task(x):
time.sleep(0.2)
return x * x
def main():
# First use of the MPI Pool
with MPIPoolExecutor() as pool:
results = list(pool.map(simple_task, range(10)))
print("First pool completed")
# Explicitly try to shut it down (though it should be shut down by the context manager)
pool.shutdown(wait=True)
# Second use of the MPI Pool
with MPIPoolExecutor() as pool:
results = list(pool.map(simple_task, range(20)))
print("Second pool completed")
if __name__ == "__main__":
main() After 6 iterations I settled on the maximum working model, having a MPIEvaluator class: import time
from mpi4py.futures import MPIPoolExecutor
def simple_task(x):
time.sleep(0.2)
return x * x
class MPIEvaluator:
def __init__(self, n_workers=None):
self._pool = None
self.n_workers = n_workers
def __enter__(self):
self._pool = MPIPoolExecutor(max_workers=self.n_workers)
print(f"MPI pool started with {self._pool._max_workers} workers")
return self._pool
def __exit__(self, exc_type, exc_val, exc_tb):
if self._pool:
self._pool.shutdown(wait=True)
print("MPI pool has been shut down")
self._pool = None
def main():
# First use of the MPIEvaluator
with MPIEvaluator(n_workers=4) as pool:
results1 = list(pool.map(simple_task, range(10)))
print("First pool completed")
# Second use of the MPIEvaluator
with MPIEvaluator(n_workers=4) as pool:
results2 = list(pool.map(simple_task, range(20)))
print("Second pool completed")
if __name__ == "__main__":
main() And the minimum breaking one: import time
from mpi4py.futures import MPIPoolExecutor
def simple_task(x):
time.sleep(0.2)
return x * x
def common_initializer():
# A basic initializer, doing almost nothing for now.
pass
class MPIEvaluator:
def __init__(self, n_workers=None):
self._pool = None
self.n_workers = n_workers
def __enter__(self):
self._pool = MPIPoolExecutor(max_workers=self.n_workers, initializer=common_initializer)
print(f"MPI pool started with {self._pool._max_workers} workers")
return self._pool
def __exit__(self, exc_type, exc_val, exc_tb):
if self._pool:
self._pool.shutdown(wait=True)
print("MPI pool has been shut down")
self._pool = None
def main():
# First use of the MPIEvaluator
with MPIEvaluator(n_workers=4) as pool:
results1 = list(pool.map(simple_task, range(10)))
print("First pool completed")
# Second use of the MPIEvaluator
with MPIEvaluator(n_workers=4) as pool:
results2 = list(pool.map(simple_task, range(20)))
print("Second pool completed")
if __name__ == "__main__":
main() The problem seemed to be the common, global initializer. So I refactored the MPIEvaluator in the workbench to don't need one, and that solved it instantly. With this effort, the logger configuration has changed and now shows slightly different behaviour. I have to determine if that's desired behaviour of not. The mocked tests still pass, so that's great I guess. TL;DR: You couldn't use the MPIEvaluator multiple times successively in a single script. Now you can. And I know what a Singleton Pattern is. It was a fun day. |
In e24cd08 I fixed multiple sequential calls (and probably properly shutting off in general), but broke logging. Logging was difficult, but not being allowed a global initializer made it even more challenging. In the end I found a somewhat elegant solution, implemented in e2ff35d. I also added some performance scaling tests. Those required multiple runs in a machine-scalable fashion, so it made setting up a environment in DelftBlue also mode robust in the process (and improved my bash skills). Performance graphs are in the main post. As expected, more complex models (flu) scale better than less complex/faster ones (lake model). All tested models have diminishing returns at some point. All the code, figures and data for this is available on the MPIEvaluator-benchmarks branch. |
7a5b073
to
2eb434e
Compare
A tutorial for the MPIEvaluator will be added in #308. |
@quaquel I know you're quite busy, but would you have time to review this sooner than later? Now the code and all the ideas behind it is still relatively fresh in my head, so I can make changes quickly without much overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be merged as long as we make clear it is experimental and somewhat WIP
Would UserWarning: "The MPIEvaluator is experimental and may change without notice." suffice? |
yes, in combination with my feedback on the tutorial in #308 |
Adds a new MPIEvaluator to the EMAworkbench, enabling experiments to be executed on multi-node High-Performance Computing (HPC) systems leveraging the mpi4py library. This evaluator optimizes performance for distributed computing environments by parallelizing experiments across multiple nodes and processors. Changes include: - Definition of the MPIEvaluator class. - Initialization function to set up the global ExperimentRunner for worker processes. - Proper handling to pack and unpack experiments for efficient data transfer between nodes. Note: This addition requires the mpi4py package only when the MPIEvaluator is explicitly used, preventing unnecessary dependencies for users not requiring this feature.
Introduced detailed logging capabilities for the MPIEvaluator to facilitate debugging and performance tracking in distributed environments. Key changes include: - Configured a logger specifically for the MPIEvaluator. - Passed logger's level to each worker process to ensure consistent logging verbosity across all nodes. - Added specific log messages to track the progress of experiments on individual MPI ranks. - Improved the log format to display the MPI process name alongside the log level, making it easier to identify logs from different nodes. - Modified `log_to_stderr` in `ema_logging` to adjust log levels for root logger based on an optional flag. With this enhancement, users can now get a clearer insight into the functioning and performance of the MPIEvaluator in HPC systems, helping in both development and operational phases.
Add mocked tests to the MPIEvaluator and include these in a single CI run 1. Integrated the MPIEvaluator into the test suite. This involves adding unit tests that ensure the new evaluator behaves as expected, with mocks simulating its interaction with `mpi4py`. 2. Enhanced the CI pipeline (in `.github/workflows/ci.yml`) to include MPI testing. This includes: - Adjustments to the matrix build, adding a configuration for testing with MPI on Ubuntu with Python 3.10. - Steps to install necessary MPI libraries and the `mpi4py` package. The MPI tests are designed to skip when run on non-Linux platforms or when `mpi4py` isn't available, ensuring compatibility with various testing environments. The use of mocking ensures that the MPIEvaluator logic is tested in isolation, focusing solely on its behavior and interaction with its dependencies, without the overhead or side effects of real MPI operations. This provides faster test execution and better control over the testing environment. mpi4py 4.0 will release at some point, if anything breaking is changed, these mocked tests might help catch that. Please note: These test don't cover actual (internal) MPI functionality and its integrations.
Addressed an issue where initializing the MPIEvaluator pool multiple times with a common initializer was causing a 'BrokenExecutor' error. Details: - Observed that using the MPIPoolExecutor twice in a row with an initializer function would lead to a 'BrokenExecutor: cannot run initializer' error on the second run. - Reproduced the issue with simplified examples to confirm that the problem was due to the initializer function in conjunction with MPIPoolExecutor. - Decided to remove the common initializer function from the MPIEvaluator to prevent this error. Changes: - Removed the global `experiment_runner` and the `mpi_initializer` function. - Modified the MPIEvaluator's `initialize` method to not use the initializer arguments. - Updated the `run_experiment_mpi` function to create the `ExperimentRunner` directly, ensuring each experiment execution has its fresh instance. Examples: Before: ```python with MPIEvaluator(model) as evaluator: results = evaluator.perform_experiments(scenarios=24) with MPIEvaluator(model) as evaluator: results2 = evaluator.perform_experiments(scenarios=48) This would fail on the second invocation with a 'BrokenExecutor' error. with MPIEvaluator(model) as evaluator: results = evaluator.perform_experiments(scenarios=24) with MPIEvaluator(model) as evaluator: results2 = evaluator.perform_experiments(scenarios=48) Now, both invocations run successfully without errors. TL;DR: By removing the common initializer, we have resolved the issue with re-initializing the MPIEvaluator pool. Users can now confidently use the MPIEvaluator multiple times in their workflows without encountering the 'BrokenExecutor' error.
Add a warning to the MPIEvaluator that it's still experimental and its interface and functionality might change in future releases. Feedback is welcome at: #311
Merged! It was quite a journey, happy that it's in. @quaquel, thanks for all the help along the way! The SEN1211 students will be developing quite heavy (geospatial) ABM models, primarily in Mesa. Since it's pure Python it should work with this implementation, it could be an interesting test case for the MPIEvaluator! I made a new discussion for feedback and future development: #311. |
This PR adds a new experiment evaluator to the EMAworkbench, the
MPIEvaluator
. This evaluator allows experiments to be conducted on multi-node systems, including High-Performance Computers (HPC) such as DelftBlue. Internally, it uses theMPIPoolExecutor
frommpi4py.futures
.Additionally, logging has been integrated to facilitate debugging and performance tracking in distributed setups. As a robustness measure, mocked tests have been added to ensure consistent behavior and they have been incorporated into the CI pipeline. This might help catch future breaking changes in mpi4py, such as with the upcoming 4.0 release (mpi4py/mpi4py#386).
This PR follows from the discussions in #266 and succeeds the development PR #292.
Conceptual design
1. MPIEvaluator Class
The
MPIEvaluator
class is at the main component of this design. Its primary role is to initiate a pool of workers across multiple nodes, evaluate experiments in parallel, and finalize resources when done.Initialization:
mpi4py
only when instantiated, preventing unnecessary dependencies for users who do not use the MPIEvaluator.Evaluation:
MPIPoolExecutor.map()
.Finalization:
2. run_experiment_mpi Function
This helper function is designed to unpack experiment data, set up the necessary logging configurations, run the experiment on the designated MPI rank (node), and return the results. This is the worker function that runs on each of the MPI ranks.
Logging:
3. Logging Enhancements
A dedicated logger for the
MPIEvaluator
was introduced to provide clarity during debugging and performance tracking. Several measures were taken to ensure uniform logging verbosity across nodes and improve log readability:pass_root_logger_level
argument has be added toema_logging.log_to_stderr
. This ensures that the root logger level is passed to all modules, so that they will log identical levels. Example:PR structure
This PR is structured in five commits:
MPIEvaluator for HPC Systems (0bc9e15)
MPIEvaluator
class.ExperimentRunner
for worker processes.mpi4py
library, it's necessary to note that the dependency onmpi4py
is only when theMPIEvaluator
is utilized, thus not imposing unnecessary packages on other users.Enhanced Logging for MPIEvaluator (59a2b7a)
MPIEvaluator
.Integration of MPIEvaluator Tests into CI (f51c29f)
MPIEvaluator
through continuous integration testing.MPIEvaluator
into the test suite using mock tests simulating its interaction withmpi4py
..github/workflows/ci.yml
) to encompass MPI testing, specifically for Ubuntu with Python 3.10.mpi4py
.mpi4py 4.0
release, potential breaking changes can be caught early through these mocked tests.MPIEvaluator
logic and its interactions, and do not delve into the actual internal workings of MPI.However, a global initializer had issues with re-initializing the MPIEvaluator pool, where the second attempt would consistently throw a 'BrokenExecutor: cannot run initializer' error. This behavior was particularly evident when invoking the MPIEvaluator consecutively in a sequence.
After reproducing the issue with simplified examples and confirming its origin, the most robust approach to address this was to eliminate the common initializer function from the MPIEvaluator. This is done in dff46cd. Since the initializer also contained the logger configuration, that part was restored in f67b194. These commits are kept separate to provide insight in the development process and design considerations of the MPIEvaluator.
Refinement of MPIEvaluator Initialization and Experiment Handling (dff46cd)
MPIEvaluator
.ExperimentRunner
and associated initializer function.MPIEvaluator
constructor to optionally accept the number of processes (n_processes
).ExperimentRunner
instantiation within therun_experiment_mpi
function to handle experiments.Logging Configuration Enhancement (f67b194)
run_experiment_mpi
function to configure logging based on the passed level, ensuring uniformity across all worker processes.Technical changes per file
CI Configuration (
ci.yml
):mpi4py
package.EMAworkbench Initialization Files:
MPIEvaluator
.Evaluator Logic Enhancements (
evaluators.py
):ExperimentRunner
for worker processes.MPIEvaluator
class with its initialization, finalization, and experiment evaluation logic.Logging Improvements (
ema_logging.py
):Test Enhancements (
test_evaluators.py
):MPIEvaluator
, conditional on the availability ofmpi4py
and a Linux environment.Logging
Logging is a big part of this PR. Being able to debug failures and errors on HPC systems effectively is important, because the iteration speed is on these systems if often low (since you have to queue jobs).
This is how the current logs work for a simple model example run:
INFO level (20)
DEBUG level
Performance
Overhead
Divided by 10 cores:
Undivided:
Scaling
The first one used shared nodes, which shows inconsistent performance:
The second two experiments claimed full nodes exclusively. While normally this isn't best practice, it was useful to get insight in performance scaling.
Both models are relatively simple with large communication overhead. Testing the performance on a more compute intensive model would be interesting for future research. The MPIEvaluator-benchmarks branch can be used for this.
Documentation
A tutorial for the MPIEvaluator will be added in PR #308.
Limitations & future enhancements
There are two main limitations currently:
Some other future improvements could be:
Review
This PR is ready for review.
When merging, the preferred method would be fast-forward merge to keep individual commit messages and be able to revert one if necessary.