rt.sh gun pdlib jobs hanging #2356
-
I am testing new MOM6 code on HERA but somehow the two gnu pdlib jobs are hanging: you can see my err file at /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_3359678/cpld_debug_pdlib_p8_gnu/err has anyone encountered this kind of situation before ? is it a system or library issue ? |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 1 reply
-
the issue could be related to my code as it didn't appear in develop branch after I tried |
Beta Was this translation helpful? Give feedback.
-
@jiandewang On Hera, I got a similar error for this test using the current develop of MOM6 + CMEPS PR. So it is not your MOM6 changes. When I re-ran in the same run directory it ran fine. You might want to try just re-running in the same failed directory. |
Beta Was this translation helpful? Give feedback.
-
@DeniseWorthen thanks for the information. I only tried develop branch once and that didn't give me error. For MOM6 PR code I am testing the issue is reproduciable as I tried at laest 6 times at different times on HERA but the issue doesn't exist on hercules. this is the error message from log file: Your Open MPI job may now hang or fail. Local host: h11c27 For the MOM6 PR, if I remove one of the "KPP commit" which contains some new 3D variable initialization, then the problem goes away. So the problem is realted to insufficient memory. |
Beta Was this translation helpful? Give feedback.
-
@junwang-noaa @DeniseWorthen @sanAkel |
Beta Was this translation helpful? Give feedback.
-
@jiandewang So 28PE resources for OCN in this test reliably works, right? |
Beta Was this translation helpful? Give feedback.
-
@DeniseWorthen I tried 20 times with 28PE and all run fine. |
Beta Was this translation helpful? Give feedback.
-
And how did you implement the change? Did you just "bump" the OCN resources in the test? |
Beta Was this translation helpful? Give feedback.
-
add OCN_tasks="$((OCN_tasks_cpl_unstr + 8))" in test/tests/cpld_debug_pdlib_p8 |
Beta Was this translation helpful? Give feedback.
-
OK, this is what we did for the pdlib wave tasks also for the debug test. That isolates the extra resources to just that test, so I think this is fine to use. |
Beta Was this translation helpful? Give feedback.
-
I had a look at these variables, and although they appear in a few OpenMP blocks, they are all declared as Initializing them could well increase the memory usage; not every compiler is going to store them in BSS, particularly a GNU debug build. It is the OpenMP stack memory that is confusing me. But maybe, as suggested by others, this is just a symptom of high memory usage and too many things bumping up against each other. It might take a serious memory inspection to figure out what's going on. |
Beta Was this translation helpful? Give feedback.
I had a look at these variables, and although they appear in a few OpenMP blocks, they are all declared as
shared()
. So I am unsure how they could have used OMP stack memory.Initializing them could well increase the memory usage; not every compiler is going to store them in BSS, particularly a GNU debug build. It is the OpenMP stack memory that is confusing me. But maybe, as suggested by others, this is just a symptom of high memory usage and too many things bumping up against each other.
It might take a serious memory inspection to figure out what's going on.