model execution and case run error #5697
Replies: 5 comments 13 replies
-
Try -res ne4_ne4. Thats the resolution we test that compset with. |
Beta Was this translation helpful? Give feedback.
-
What kind of network is used to connect the nodes in your local cluster? |
Beta Was this translation helpful? Give feedback.
-
FAQP is an "aqua planet" case, and is basically an atmosphere-only model. To get a rough estimate of expected performance, you can look at atmopshere component performance in this paper: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2022MS003156 Those results are for the "ne30pg2" atmosphere grid, which is our standard low resolution model. "ne4" is a very coarse resolution for testing - it should be well over 10x faster. But it can only use efficiently up to 96 MPI tasks. On a modern Intel Xeon or AMD Epyc cluster with low latency interconnect, on 96 cores, I'd think you could get close to 100 simulated years per day. The standard 5 day run will write a lot of data for the restart file which will skew the timings if your I/O is slow. |
Beta Was this translation helpful? Give feedback.
-
your machine file is setup to allow up to 48 tasks per node, but only 1 MPI task per node. both should be changed to 48. if you machine has only 24 cores (and the 48 is counting hyperthreading), set them both to 24, and only worry about using hyperthreading after getting good performance without it. Another important thing for performance is core bindings - that's an option to mpirun to bind each MPI task to a single core. you'll need to check locally what that should be on keeling. |
Beta Was this translation helpful? Give feedback.
-
From that paper, I think this resolution on 5 nodes (AMD EPYC, 64 hardware cores per node) was 5 SYPD. You're getting 0.6 SYPD on 3 nodes. about 6x slower than a relatively new AMD EPYC system. A little slow, but could be explained by an old machine, or skewed benchmarks from a 5 day run (which is dominated by I/O costs instead of compute). At this resolution, model should scale well up to 5400 MPI tasks, so the fact that you get no improvement going from 3 to 6 nodes suggests something is still wrong with the configuration/ mpirun command. you should be able to get a 1.5x speedup by switching to "ne30pg2" for the atmosphere grid, matching the grid used in that paper. |
Beta Was this translation helpful? Give feedback.
-
I am graduate student of UIUC Atmospheric Sciences, trying to port E3SM 2.1 on our local computing cluster. I have created, built and run for compset A and X successfully to our machine. However, I am now facing an error while running F compset case.
I used resolution ne11_ne11 and compset FAQP to create new case
MODEL BUILD HAS FINISHED SUCCESSFULLY
E3SM log
e3sm.log.1672085.230517-161207.txt
Machine Configuration: --mach keeling
keeling.cmake.txt
config_machines.xml.txt
config_batch.xml.txt
build_environment.txt
Beta Was this translation helpful? Give feedback.
All reactions