Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 694: Upgrade/refactoring for U and V write-out sub for FV3REG GSI failure … #698

Conversation

TingLei-daprediction
Copy link
Contributor

@TingLei-daprediction TingLei-daprediction commented Feb 14, 2024

DUE DATE for merger of this PR into develop is 3/27/2024 (six weeks after PR creation).

Resolves #693 (Thanks to @edwardhartnett 's suggestions)
Resolves # 694 ( this PR is not able to provide a stable solution, more details would be given on the issue page)
Resolves # 697: With larger requested memory for each mpi task, it still showed, for some time, the differences in the analysis files between loproc vs hiproc for the control runs on hercules. whether integrating this with the refactored IO part would provide a stable solution remains to be seen.

This PR resolved the newly emerged issue with IO of netcdf files in the continuous storage, with upgraded FV3REG IO for the cold start options. (Co author Ming Hu @hu5970 )
This PR is being worked on in collaboration with Pete Johnson through RDHPCS help desk, @RussTreadon-NOAA @DavidHuber-NOAA and thanks to help from @ed Raghue Reddy through RDHPCS help desk.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , please update TingLei-daprediction:feature/fv3reg_parallel_io_upgrade to the current head of NOAA-EMC/GSI develop. PR #684 is not in your branch.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Sure. I am verifying TingLei-daprediction:feature/fv3reg_parallel_io_upgrade with the current ECM GSI .

@RussTreadon-NOAA
Copy link
Contributor

PR #684 was merged into NOAA-EMC/GSI develop yesterday afternoon (1:07 pm EST, 2/14/2024). Your branch is one commit behind the current EMC GSI develop.

@hu5970
Copy link
Collaborator

hu5970 commented Feb 16, 2024

@TingLei-NOAA The RRFS does not use the "fv3_io_layout_y > 1" option anymore.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Feb 20, 2024

An update: Thanks to discussions/collaborations with Peter Johnson at Hercules help desk , @RussTreadon-NOAA, @DavidHuber-NOAA , @edwardhartnett and other colleagues, it is found : 1), I agree with Peter Johnson's speculation that the culprit is with the mpi lib on herculues, while further experiments could be run like using different mpi lib to confirm this if needed; 2) the current PR #698 seems to be a solution/work-around when it has been running successfully in my runs of 400 plus . It should be noted, without the nf90_collective mode added as Peter Johnson proposed, GSI would fail once in 50 plus runs. The issue was found and resolved for the warm-restart case, but the similar change has also been added to the write-out subroutine for the cold start cases: gsi_fv3ncdf_writeuv_v1. The verification of this part is to be updated later.

@RussTreadon-NOAA
Copy link
Contributor

Hercules hafs_3denvar_hybens

Build develop at 74ac594 and TingLei-daprediction:feature/fv3reg_parallel_io_upgrade at e0dc8d0 on Hercules.

The hafs_3denvar_hybens ctest fails due to non-reproducible fv3_tracer files. The hiproc_contrl, loproc_contrl, and loproc_updat fv3_tracer are identical. The hiproc_updat differs for variable sphum

xaxis_1 min/max 1=1.0,720.0 min/max 2=1.0,720.0 max abs diff=0.0000000000
yaxis_1 min/max 1=1.0,540.0 min/max 2=1.0,540.0 max abs diff=0.0000000000
zaxis_1 min/max 1=1.0,65.0 min/max 2=1.0,65.0 max abs diff=0.0000000000
Time min/max 1=1.0,1.0 min/max 2=1.0,1.0 max abs diff=0.0000000000
sphum min/max 1=0.0,0.022951916 min/max 2=0.0,0.022951916 max abs diff=0.0000935391
liq_wat min/max 1=-1.27048754e-20,0.001674175 min/max 2=-1.27048754e-20,0.001674175 max abs diff=0.0000000000
rainwat min/max 1=-6.4782625e-19,0.0052170595 min/max 2=-6.4782625e-19,0.0052170595 max abs diff=0.0000000000
ice_wat min/max 1=-9.211089e-21,0.0016910682 min/max 2=-9.211089e-21,0.0016910682 max abs diff=0.0000000000
snowwat min/max 1=-4.735845e-20,0.0022513575 min/max 2=-4.735845e-20,0.0022513575 max abs diff=0.0000000000
graupel min/max 1=-8.9380465e-20,0.01118745 min/max 2=-8.9380465e-20,0.01118745 max abs diff=0.0000000000
o3mr min/max 1=1.4650413e-08,1.6894655e-05 min/max 2=1.4650413e-08,1.6894655e-05 max abs diff=0.0000000000
sgs_tke min/max 1=-1.38765404e-17,37.100296 min/max 2=-1.38765404e-17,37.100296 max abs diff=0.0000000000
cld_amt min/max 1=0.0,1.0 min/max 2=0.0,1.0 max abs diff=0.0000000000

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Yes this PR itself couldn't solely resolve the issue #697

@RussTreadon-NOAA
Copy link
Contributor

Oh, I see. I read your comment

  1. the current PR Issue 694: Upgrade/refactoring for U and V write-out sub for FV3REG GSI failure … #698 seems to be a solution/work-around when it has been running successfully in my runs of 400 plus .

and assumed you had a fix.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA I meant this PR is the fix/work-around for issue 694 (Parallel netCDF I/O failures on Hercules with I_MPI_EXTRA_FILESYSTEM=1).

@RussTreadon-NOAA
Copy link
Contributor

Thank for the clarification. ush/sub_hercules in TingLei-daprediction:feature/fv3reg_parallel_io_upgrade still includes

echo "module load gsi_hercules.intel" >> $cfile
#TODO reenable I_MPI_EXTRA_FILESYSTEM once regional ctests can properly handle parallel I/O on Hercules
echo "unset I_MPI_EXTRA_FILESYSTEM"         >> $cfile
echo ""                                                    >> $cfile

Bottom line: we still have non-reproducible hafs_3denvar_hybens results on Hercules.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks. I will clean that part.

@TingLei-NOAA
Copy link
Contributor

As mentioned earlier, the similar refactoring and changes have been done with the write-out of winds for the cold start option. @ShunLiu-NOAA and @hu5970 recently found, with newer netcdf lib, when the cold start input files are in continuous storage, the GSI would become idle. Further investigation confirmed it become idle for those reading mpi processes become idle. Hence, it is believed that changes as done in #571 are needed for the reading subroutine for the cold start files and will be included in #698 corresponding to this issue.

@RussTreadon-NOAA
Copy link
Contributor

Hercules test

Install TingLei-daprediction:feature/fv3reg_parallel_io_upgrade at e0dc8d0. Run hafs_3denvar_hybens and hafs_4denvar_glbens. Test hafs_3denvar_hybens passed. Test hafs_4denvar_glbens failed.

hafs_4denvar_glbens failed because output file fv3_tracer is not identical between the loproc_updat and hiproc_updat runs. Differences are found in the sphum field

xaxis_1 min/max 1=1.0,720.0 min/max 2=1.0,720.0 max abs diff=0.0000000000
yaxis_1 min/max 1=1.0,540.0 min/max 2=1.0,540.0 max abs diff=0.0000000000
zaxis_1 min/max 1=1.0,65.0 min/max 2=1.0,65.0 max abs diff=0.0000000000
Time min/max 1=1.0,1.0 min/max 2=1.0,1.0 max abs diff=0.0000000000
sphum min/max 1=0.0,0.022629073 min/max 2=0.0,0.022629073 max abs diff=0.0000017433
liq_wat min/max 1=-1.27048754e-20,0.001674175 min/max 2=-1.27048754e-20,0.001674175 max abs diff=0.0000000000
rainwat min/max 1=-6.4782625e-19,0.0052170595 min/max 2=-6.4782625e-19,0.0052170595 max abs diff=0.0000000000
ice_wat min/max 1=-9.211089e-21,0.0016910682 min/max 2=-9.211089e-21,0.0016910682 max abs diff=0.0000000000
snowwat min/max 1=-4.735845e-20,0.0022513575 min/max 2=-4.735845e-20,0.0022513575 max abs diff=0.0000000000
graupel min/max 1=-8.9380465e-20,0.01118745 min/max 2=-8.9380465e-20,0.01118745 max abs diff=0.0000000000
o3mr min/max 1=1.4650413e-08,1.6894655e-05 min/max 2=1.4650413e-08,1.6894655e-05 max abs diff=0.0000000000
sgs_tke min/max 1=-1.38765404e-17,37.100296 min/max 2=-1.38765404e-17,37.100296 max abs diff=0.0000000000
cld_amt min/max 1=0.0,1.0 min/max 2=0.0,1.0 max abs diff=0.0000000000

The above ctests were run in the /work fileset.

The hafs tests were also run in /work2. hafs_3denvar_hybens generated identical output files for loproc_updat and hiproc_updat. Output file fv3_tracer differs between the loproc and hiproc runs hafs_4denvar_glbens. The difference is in o3mr

xaxis_1 min/max 1=1.0,720.0 min/max 2=1.0,720.0 max abs diff=0.0000000000
yaxis_1 min/max 1=1.0,540.0 min/max 2=1.0,540.0 max abs diff=0.0000000000
zaxis_1 min/max 1=1.0,65.0 min/max 2=1.0,65.0 max abs diff=0.0000000000
Time min/max 1=1.0,1.0 min/max 2=1.0,1.0 max abs diff=0.0000000000
sphum min/max 1=0.0,0.022629073 min/max 2=0.0,0.022629073 max abs diff=0.0000000000
liq_wat min/max 1=-1.27048754e-20,0.001674175 min/max 2=-1.27048754e-20,0.001674175 max abs diff=0.0000000000
rainwat min/max 1=-6.4782625e-19,0.0052170595 min/max 2=-6.4782625e-19,0.0052170595 max abs diff=0.0000000000
ice_wat min/max 1=-9.211089e-21,0.0016910682 min/max 2=-9.211089e-21,0.0016910682 max abs diff=0.0000000000
snowwat min/max 1=-4.735845e-20,0.0022513575 min/max 2=-4.735845e-20,0.0022513575 max abs diff=0.0000000000
graupel min/max 1=-8.9380465e-20,0.01118745 min/max 2=-8.9380465e-20,0.01118745 max abs diff=0.0000000000
o3mr min/max 1=1.4650413e-08,1.6894655e-05 min/max 2=1.4650413e-08,1.6894655e-05 max abs diff=0.0000000318
sgs_tke min/max 1=-1.38765404e-17,37.100296 min/max 2=-1.38765404e-17,37.100296 max abs diff=0.0000000000
cld_amt min/max 1=0.0,1.0 min/max 2=0.0,1.0 max abs diff=0.0000000000

How does program execution differ between hafs_3denvar_hybens and hafs_4denvar_glbens with regards to the writing of output file fv3_tracer?

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks. for the hafs issue on the differences between loproc and hiproc, @yonghuiweng had some updates on #697

@RussTreadon-NOAA
Copy link
Contributor

OK, so what's the path forward to get the hafs ctests consistently passing on Hercules? Will updates be committed to TingLei-daprediction:feature/fv3reg_parallel_io_upgrade in the near future?

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA There will be a google meeting for that issue among Yonghui, Bin and me to get some clarifications on current findings and see how to proceed from our point of view . You will definitely be updated on it and all of us will see how to proceed. Let me know if you have any suggestions for being now.

@RussTreadon-NOAA
Copy link
Contributor

Sounds good. We want to get to the bottom of this sooner rather than later.

This is an odd problem. The hafs tests pass on WCOSS2, Hera, and Orion. Orion and Hercules use the same filesets. My concern is that the Hercules issue is somehow related to Rocky-8, module versions, and/or the installation of supporting libraries on Hercules. Will we see similar problems when Hera and Orion are updated to Rocky-8 in April?

@edwardhartnett
Copy link

All, for future reference, there is an internal logging capability in netcdf-c which might help with these kinds of problems. For parallel programs, the netcdf-c library generates a log file for each processor, with detailed information about what netCDF functions are called. Logging needs to be turned on in netcdf-c for this to work, but we can arrange that.

It's documented here: https://github.com/Unidata/netcdf-c/blob/main/docs/logging.md

@TingLei-NOAA
Copy link
Contributor

All regression tests passed on orion while for rrfs_3denvar_glbens with the "Failure time-thresh" ignored.
On hercules, only hafs_4densvar_glbens failed for the control runs (both loproc and hiproc) produced different fv3_dynvars as investigated in #697. So, I would suggest this test failure could be ignored regarding this PR.

@DavidHuber-NOAA
Copy link
Collaborator

@TingLei-daprediction Thanks for updating the job CPU/node counts. Can you sync your branch with develop? Once that is done, I will restart testing on Jet.

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA . Thanks! "Sync " has been done.

Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good and regression tests pass on Jet. Approve.

@ShunLiu-NOAA
Copy link
Contributor

@JingCheng-NOAA and @XuLu-NOAA If you are available, please review Ting's PR again. Thank you.

@JingCheng-NOAA
Copy link
Contributor

I've tested Ting's latest update again on Hercules. The hafs_4densvar_glbens failed as expected, which should not affect this PR. But global_4denvar also failed, which is due to access issue to the prepbufr file /work/noaa/da/rtreadon/CASES/regtest/gfs/prod/gdas.20240223/00/obs/gdas.t00z.prepbufr that used by this case.
All other test cases passed.

@RussTreadon-NOAA
Copy link
Contributor

gdas.t00z.prepbufr is a restricted access file. Users must belong to the rstprod group to access this file. The following files in /work/noaa/da/rtreadon/CASES/regtest/gfs/prod/gdas.20240223/00/obs/ are rstprod files

-rw-r----- 1 rtreadon rstprod   18764472 Feb 23 05:51 gdas.t00z.adpsfc.tm00.bufr_d
-rw-r----- 1 rtreadon rstprod   19164416 Feb 23 05:50 gdas.t00z.aircar.tm00.bufr_d
-rw-r----- 1 rtreadon rstprod    3667912 Feb 23 05:50 gdas.t00z.aircft.tm00.bufr_d
-rw-r----- 1 rtreadon rstprod     287840 Feb 23 05:50 gdas.t00z.gpsipw.tm00.bufr_d
-rw-r----- 1 rtreadon rstprod   44055880 Feb 23 05:52 gdas.t00z.gpsro.tm00.bufr_d
-rw-r----- 1 rtreadon rstprod   11669280 Feb 23 06:02 gdas.t00z.nsstbufr
-rw-r----- 1 rtreadon rstprod  101586864 Feb 23 06:02 gdas.t00z.prepbufr
-rw-r----- 1 rtreadon rstprod   16949456 Feb 23 06:02 gdas.t00z.prepbufr.acft_profiles
-rw-r----- 1 rtreadon rstprod          0 Feb 23 05:53 gdas.t00z.saphir.tm00.bufr_d
-rw-r----- 1 rtreadon rstprod    9692304 Feb 23 05:50 gdas.t00z.sfcshp.tm00.bufr_d

@JingCheng-NOAA
Copy link
Contributor

@JingCheng-NOAA and @guoqing-noaa Could you please let me know whether you are available to review this PR?

I've been doing the regression test on Hercules since Friday. But so far, it seems like something not setting up correctly. I am double checking it to see which part went wrong. Test project /work2/noaa/hwrf/save/jcheng/GSI/GSItest/build Start 1: global_4denvar 1/7 Test #1: global_4denvar ................... Passed 1326.62 sec Start 2: rtma 2/7 Test #2: rtma ............................. Passed 965.50 sec Start 3: rrfs_3denvar_glbens 3/7 Test #3: rrfs_3denvar_glbens ..............***Failed 425.44 sec Start 4: netcdf_fv3_regional 4/7 Test #4: netcdf_fv3_regional ..............***Timeout 86400.06 sec Start 5: hafs_4denvar_glbens 5/7 Test #5: hafs_4denvar_glbens ..............***Timeout 86400.04 sec Start 6: hafs_3denvar_hybens

Then has the link of the data of prepbufr in regression test changed? Because two weeks ago, the global_4denvar case passed in my tests on Hercules.
Other than this issue, I have no more questions about this PR.

@RussTreadon-NOAA
Copy link
Contributor

The case for the global ctests was recently updated to bring in GMI data. The previous global case did NOT properly restrict rstprod data.

@ShunLiu-NOAA
Copy link
Contributor

@JingCheng-NOAA Thank you Jing for the test. Since Ting already completed regression test on WCOSS2 and Orion, @DavidHuber-NOAA completed the test on JET. We may consider merging this PR to develop.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Mar 20, 2024

@DavidHuber-NOAA @XuLu-NOAA @JingCheng-NOAA , Thanks for your help as the reviewers.

@ShunLiu-NOAA ShunLiu-NOAA merged commit 2167bc9 into NOAA-EMC:develop Mar 20, 2024
4 checks passed
@TingLei-daprediction
Copy link
Contributor Author

@ShunLiu-NOAA Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A fix needed for mismatch dimensions of parameters in netcdf IO for fv3reg GSI
9 participants