Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model crash at a specific forecast time #2494

Open
benjamin-cash opened this issue Nov 8, 2024 · 8 comments
Open

Model crash at a specific forecast time #2494

benjamin-cash opened this issue Nov 8, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@benjamin-cash
Copy link

I am running a 6-month, 10-member C192mx025 ensemble, and so far 4 of the 10 members have reached a specfic n_atmsteps value and crashed with the error below (slight variations between runs).

zeroing coupling accumulated fields at kdt=        12527
 zeroing coupling accumulated fields at kdt=        12527
PASS: fcstRUN phase 2, n_atmsteps =            12526 time is         0.326497
  (zap_snow_temperature)zap_snow_temperature: temperature out of bounds!
  (zap_snow_temperature)k:           1
  (zap_snow_temperature)zTsn:  4.636445953738447E-003
  (zap_snow_temperature)Tmin:  -100.000000000000
  (zap_snow_temperature)Tmax:  9.797644457066566E-006
  (zap_snow_temperature)zqsn:  -110216777.762791
 ncells=           5
 nlives=          12
 nthresh=   18.0000000000000

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2263: comm->shm_numa_layout[my_numa_node].base_addr
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(MPL_backtrace_show+0x24) [0x2b50027c0bf4]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x2b5002367ba1]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x368e7e) [0x2b5002168e7e]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x262d66) [0x2b5002062d66]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x2ae65c) [0x2b50020ae65c]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x26a45a) [0x2b500206a45a]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x25f7c9) [0x2b500205f7c9]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(+0x379451) [0x2b5002179451]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/intel-oneapi-mpi-2021.12.1-pvycn2u/mpi/2021.12/lib/libmpi.so.12(MPI_Bcast+0x22c) [0x2b5001f53d8c]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/parallelio-2.6.2-d7wj357/lib/libpioc.so(PIOc_createfile_int+0x336) [0x2b5000bfb186]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/parallelio-2.6.2-d7wj357/lib/libpioc.so(PIOc_createfile+0x41) [0x2b5000bf70c1]
/opt/spack-stack/spack-stack-1.8.0/envs/unified-env/install/intel/2021.10.0/parallelio-2.6.2-d7wj357/lib/libpiof.so(piolib_mod_mp_createfile_+0x1ae) [0x2b5000c5642e]

Looking at the history files, in each case the model was part-way through writing out gefs.ocean.t00z.24hr_avg.f2064.nc when it crashed. All of the crashes were at different times, so it wasn't something simple like a disk issue that caused a temporary error in writing files. Is there some kind of internal limit that I've encountered?

@benjamin-cash benjamin-cash added the bug Something isn't working label Nov 8, 2024
@LarissaReames-NOAA
Copy link

How often are you writing ice history/restart files?

@benjamin-cash
Copy link
Author

24h. Rahul thinks that I potentially hit a limit on CICE outputs - it sounds like you are thinking in the same direction.

@DeniseWorthen
Copy link
Collaborator

At 1/x day for CICE, even the full 6months would only be 180 files. That is well below the "600 file" approx limit.

@LarissaReames-NOAA
Copy link

Yeah, I agree with Denise's assessment. Can you tell which process it's failing on?

@benjamin-cash
Copy link
Author

benjamin-cash commented Nov 8, 2024

I don't seem to have any more debug information than what is in the original message (no PET files). The biggest clue I have so far is that in all four runs that crashed the crash occurred part way through writing gefs.ocean.t00z.24hr_avg.f2064.nc. It also looks like they all generated gefs.ice.t00z.24hr_avg.f2088.nc but the file has 0 size.

Edit: Given that the error is coming from PIOc_createfile it seems like maybe the error is coming from attempting to create that gefs.ice.t00z.24hr_avg.f2088.nc file?

@benjamin-cash
Copy link
Author

Update - I am running a new case with breakpnts set every 1472 hours in the global workflow and the run has now progressed past this point (currently at about hour 2500), so it certainly appears that my earlier runs encountered an internal limit of some kind.

@NickSzapiro-NOAA
Copy link
Collaborator

@benjamin-cash Do you continue to see this zap_snow_temperature error in the log files (even if it doesn't crash)?

@benjamin-cash
Copy link
Author

@NickSzapiro-NOAA - Yes, those errors appear steadily throughout the simulation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants