Setting DOUT_S=true in run_template.sh script resulted in parallel submission of jobs with RESUBMIT>0. #5903
Closed
jacob-stu-allen
started this conversation in
E3SM model help
Replies: 2 comments
-
I also saw this recently and made an issue #5901. For now I would recommend doing short term archiving "manually" |
Beta Was this translation helpful? Give feedback.
0 replies
-
Yes please follow up on the issue above. This is a bug. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I've been using a slightly modified (mostly namelist/mpas changes) version of the run_e3sm.template.sh, and recently encountered a very undesirable effect from setting DOUT_S=true on a hybrid, WCYCL1850S case branching from a 500-year piControl run (v2_LR).
As expected, the st_archive moves the ${RUNDIR}/timing/checkpoints/* files as well as some of the log files into a new archive directory. However, the simulation than fails because it appears the model can't find these timing/log files to write to. This happens in the POST_RUN_CHECK, after the model appears to have successfully integrated.
What's strange (I haven't figured out why this part happened), is that for my case which had RESUBMIT=12 and STOP_N=8, is that the model was than initialized 12 times over the course of an hour on Perlmutter, without an individual run coming to completion. Data was output for 14 years in the ${RUNDIR}/run directory, and about 6 years of data was archived into ${RUNDIR}/archive/. In total, only 14 years were successfully output, yet each simulation used the ~550 node hours that it takes for 8 simulated years in this configuration.
run_dir_logs.zip
The archived e3sm.log files have no errors, but the e3sm.log file in the ${RUNDIR}/run directory references a "file_not_found" for timing/logs. The CaseStatus file in the case_scripts directory shows the best summary of what happened.
case_scripts_dir_logs.zip
This ate up a large chunk of our allocation without outputting useable data, so I'm posting here to see if this issue has been brought up before and to hopefully prevent someone else from doing the same thing.
I've attached the log files if they're helpful and I'm happy to provide anything else.
Cheers,
Jacob
Beta Was this translation helpful? Give feedback.
All reactions