Model crash at a specific forecast time #2494

benjamin-cash · 2024-11-08T16:09:03Z

I am running a 6-month, 10-member C192mx025 ensemble, and so far 4 of the 10 members have reached a specfic n_atmsteps value and crashed with the error below (slight variations between runs).

zeroing coupling accumulated fields at kdt=        12527
 zeroing coupling accumulated fields at kdt=        12527
PASS: fcstRUN phase 2, n_atmsteps =            12526 time is         0.326497
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  4.636445953738447E-003
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  9.797644457066566E-006
  (zap_snow_temperature)zqsn:  -110216777.762791
 ncells=           5
 nlives=          12
 nthresh=   18.0000000000000

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2263: comm->shm_numa_layout[my_numa_node].base_addr
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(MPL_backtrace_show+0x24) [0x2b50027c0bf4]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x2b5002367ba1]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x368e7e) [0x2b5002168e7e]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x262d66) [0x2b5002062d66]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x2ae65c) [0x2b50020ae65c]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x26a45a) [0x2b500206a45a]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x25f7c9) [0x2b500205f7c9]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x379451) [0x2b5002179451]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(MPI_Bcast+0x22c) [0x2b5001f53d8c]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/parallelio-2.6.2-d7wj357/lib/libpioc.so(PIOc_createfile_int+0x336) [0x2b5000bfb186]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/parallelio-2.6.2-d7wj357/lib/libpioc.so(PIOc_createfile+0x41) [0x2b5000bf70c1]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/parallelio-2.6.2-d7wj357/lib/libpiof.so(piolib_mod_mp_createfile_+0x1ae) [0x2b5000c5642e]

Looking at the history files, in each case the model was part-way through writing out gefs.ocean.t00z.24hr_avg.f2064.nc when it crashed. All of the crashes were at different times, so it wasn't something simple like a disk issue that caused a temporary error in writing files. Is there some kind of internal limit that I've encountered?

The text was updated successfully, but these errors were encountered:

LarissaReames-NOAA · 2024-11-08T16:35:48Z

How often are you writing ice history/restart files?

benjamin-cash · 2024-11-08T16:37:45Z

24h. Rahul thinks that I potentially hit a limit on CICE outputs - it sounds like you are thinking in the same direction.

DeniseWorthen · 2024-11-08T16:42:46Z

At 1/x day for CICE, even the full 6months would only be 180 files. That is well below the "600 file" approx limit.

LarissaReames-NOAA · 2024-11-08T16:43:30Z

Yeah, I agree with Denise's assessment. Can you tell which process it's failing on?

benjamin-cash · 2024-11-08T16:53:14Z

I don't seem to have any more debug information than what is in the original message (no PET files). The biggest clue I have so far is that in all four runs that crashed the crash occurred part way through writing gefs.ocean.t00z.24hr_avg.f2064.nc. It also looks like they all generated gefs.ice.t00z.24hr_avg.f2088.nc but the file has 0 size.

Edit: Given that the error is coming from PIOc_createfile it seems like maybe the error is coming from attempting to create that gefs.ice.t00z.24hr_avg.f2088.nc file?

benjamin-cash · 2024-11-13T21:15:53Z

Update - I am running a new case with breakpnts set every 1472 hours in the global workflow and the run has now progressed past this point (currently at about hour 2500), so it certainly appears that my earlier runs encountered an internal limit of some kind.

NickSzapiro-NOAA · 2024-11-15T09:39:53Z

@benjamin-cash Do you continue to see this zap_snow_temperature error in the log files (even if it doesn't crash)?

benjamin-cash · 2024-11-15T13:28:57Z

@NickSzapiro-NOAA - Yes, those errors appear steadily throughout the simulation.

benjamin-cash added the bug Something isn't working label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model crash at a specific forecast time #2494

Model crash at a specific forecast time #2494

benjamin-cash commented Nov 8, 2024

LarissaReames-NOAA commented Nov 8, 2024

benjamin-cash commented Nov 8, 2024

DeniseWorthen commented Nov 8, 2024

LarissaReames-NOAA commented Nov 8, 2024

benjamin-cash commented Nov 8, 2024 •

edited

Loading

benjamin-cash commented Nov 13, 2024

NickSzapiro-NOAA commented Nov 15, 2024

benjamin-cash commented Nov 15, 2024

Model crash at a specific forecast time #2494

Model crash at a specific forecast time #2494

Comments

benjamin-cash commented Nov 8, 2024

LarissaReames-NOAA commented Nov 8, 2024

benjamin-cash commented Nov 8, 2024

DeniseWorthen commented Nov 8, 2024

LarissaReames-NOAA commented Nov 8, 2024

benjamin-cash commented Nov 8, 2024 • edited Loading

benjamin-cash commented Nov 13, 2024

NickSzapiro-NOAA commented Nov 15, 2024

benjamin-cash commented Nov 15, 2024

benjamin-cash commented Nov 8, 2024 •

edited

Loading