-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains #871
[develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains #871
Conversation
…etatasks that do not exist or have been removed by other checks.
- Remove extraneous empty metatask dependencies from MET_verification_only_vx test now that these are handled in setup.py - Add ability to specify a subdirectory of test_configs to run all tests in it - Move custom grid yamls to separate directory
non-config files are skipped in 'all' and directory tests
…is most expensive at ~60 core hours, rest are 30 or less.
…t can take the values "python", "cron", or "none" - Change names in argparse section to shorten lines
- Update custom grids for more appropriate physics suites - Speed up Peru 12km case with more nodes and shorter forecast time - Distribute new custom domains to comprehensive and coverage tests
symlink - Unless running in debug mode, call "set_template" with "quiet" flag
for more verbose output - Also add debug flag for set_FV3nml_sfc_climo_filenames.py - Convert most print messages to debug-only for cleaner output - Do not read VERBOSE flag for generate script; this flag should only affect experiments, not initial scripts
…oes not have HPSS access)
…that will only run rocotorun once for each experiment rather than continuously monitoring
20fdb32
to
2b9ac58
Compare
567c5ca
to
7567188
Compare
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot | ||
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 | ||
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR | ||
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta | ||
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR | ||
#nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 | ||
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich Should we remove the "nco_" prefix from test names now that tests can be run in either NCO or community mode without having to specify the mode in the experiment config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think including "nco" in these test names is still necessary, since they are explicitly run in NCO mode unless "community" is specified on the command line (or if it's a a community-mode coverage test). Now, whether or not these tests need to still exist is another question; it doesn't look like any of them are unique, so we could eliminate them altogether so long as all those capabilities are tested as part of the Hera NCO coverage suite. I think it's a good idea to remove the "nco_" tests entirely, so if you agree then I'll go ahead and do that. @MichaelLueken should probably approve of that plan as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take back one minor point: there is one thing that is only tested in the "nco_" tests, and that is using pre-staged grid/orog/sfc_climo data. I can't remember, is that something that should be specific to NCO tasks, or can we roll that into another test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it this morning, I think we should defer this topic to a different PR, especially since @MichaelLueken is on leave this week. I opened an issue #895 to discuss this, feel free to chime in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich A couple of things:
- I see 4 tests that start with "nco_" under
ufs-srweather-app/tests/WE2E/test_configs/grids_extrn_mdls_suites_nco
. I started checking if there are counterparts of these undergrids_extrn_mdls_suites_community
to see if the "nco_" tests can be eliminated but decided to not finish because it's a pretty time consuming task (at least a couple of hours). I made an issue at Remove any WE2E tests for NCO mode (names starting with "nco_") that test features that are already tested in community mode #896. Feel free to comment there if I didn't capture the issue correctly. - Use of pre-staged grid/orog/sfc_climo data is not specific to NCO mode, but I'd say it is more important for NCO mode because (at least as of a couple of years ago) EMC (who always seem to run in NCO mode) did not want to have the
make_[grid|orog|sfc_climo]
tasks as part of their workflow, i.e. they always used pre-staged data. That's why NCO mode always uses pre-staged files (or at least used to; not sure now). But these are also important in community mode because once community-mode users get their grid/orog/sfc_climo files set up, they tend to want to skip those tasks in later experiments.
help='Explicitly set EXPT_BASEDIR for all experiments') | ||
ap.add_argument('--exec_subdir', type=str, | ||
help='Explicitly set EXEC_SUBDIR for all experiments') | ||
ap.add_argument('--use_cron_to_relaunch', action='store_true', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich Seeing the cron option reminded me of a bug that I've run into with the run_WE2E_tests.py
script, which is that when the user's cron table contains a commented line (job) that is exactly identical to the line that this script would otherwise insert into the table, the script doesn't insert the line (so the experiment doesn't get (re)launched). I guess it's because the script doesn't check that the existing line is commented out and so thinks the line is already in the table. Can you run a quick test to see if that is still happening? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich Just realized the bug (if it still exists) might be in the script get_crontab_contents.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian Just included a fix for that bug as well, feel free to test it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich Ok, thanks. I'll trust you that it works!
@@ -0,0 +1,26 @@ | |||
metadata: | |||
description: |- | |||
This test checks the capability of the workflow to retrieve from NOAA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich Please add in the description that this test is also to check tha the weather model can perform a relatively long forecast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…if the command already exists but is commented, we still want to add it
@mkavulich - The Jenkins tests failed on August 26 due to an issue with Orion and the
I'm fairly certain that a rerun would allow the test to successfully run without issue. Before rerunning, I've been told that the EPIC role account now has rstprod permission. Would you be okay with uncommenting the @mkavulich and @gsketefian - I agree with @mkavulich, it is certainly a good idea to trim/remove the NCO WE2E tests. However, I don't feel that this should be done in this PR. |
@mkavulich - It also sounds like the various RDHPCS machines are all using the EPIC role accounts. Would you be able to change the |
@MichaelLueken I have made those suggested changes, but they are not appearing here. According to https://www.githubstatus.com/, PRs are currently "degraded" so it seems like there's a problem on Github's end that's preventing the PR from being updated. Hopefully it's resolved soon and you can re-run the tests. |
@mkavulich - The changes look good and I have resubmitted the Jenkins tests. Once the retests are complete, this work should be ready to be merged. Thanks! |
@mkavulich - The reruns of the Hera Intel tests are still seeing failures in the new The good news is that the tests successfully passed on Gaea and Orion. Additionally, outside of the tests noted above, all other tests have successfully passed. |
@MichaelLueken That is so strange, I've run the |
@mkavulich - Unfortunately, I'm not aware of the reason for the last failures of the test (though, it is likely due once again to the weird I've also cancelled the Jet testing, since HPSS is down for maintenance as well and the |
@MichaelLueken Ahh, I somehow forgot that the Intel tests are the ones run in NCO mode. So then this is another fleeting symptom of #652? |
@mkavulich - Yes, I suspect that this is a continuation of what was being encountered with Issue #652. Now, however, instead of mislinking COM directories, it is now leading to these weird NetCDF failures. |
I was able to successfully run the Hera Intel tests manually this morning:
I had to use rocotorewind/boot on the I'm now waiting to see if the |
The
The error is: Please comment out the two verification tests until this issue can be sorted. Thanks! |
@mkavulich - I have sent a request to the RDHPCS HPSS Helpdesk to ask that the EPIC role account be granted rstprod permissions on HPSS. Hopefully they are able to add this permission quickly, then I can submit the Hera GNU tests to make sure that the tests that require HPSS data are able to successfully pull the data from HPSS. |
@mkavulich - Good news! Skylar has added the EPIC role account to the rstprod group on HPSS. Once the current Jet tests finish, then the Hera GNU tests will run. If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the file name has .com extension?.. Was the intent to have *.nco file with the list of tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@natalie-perlin - The .com extension is to ensure that all tests in the file run in the community environment. Within this file, there is an nco test, nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
. When this test is launched, the nco environment is overwritten so that it will run in community mode.
This is similar to the coverage.hera.intel.nco
file. All of the tests in this file are normally run in community mode, but the .nco extension ensures that the tests use the nco environment instead.
There are no plans to include either a coverage.hera.intel.com
or coverage.hera.gnu.nco
file.
The Jenkins automated
Now moving forward with merging this work. |
DESCRIPTION OF CHANGES:
This represents the final round of WE2E test improvements related to Issue #587, including new tests, additional features, script improvements, and removal of unsupported domains.
Script improvements
run_WE2E_tests.py
--use_cron_to_relaunch
with a--launch
argument that can take the values "python", "cron", or "none" (the last of which will create the experiments but not run them)monitor_jobs.py
--mode
flag, which if specified as "advance" will only runrocotorun
once for each experiment, then quit.generate_FV3LAM_wflow.py
andset_FV3nml_sfc_climo_filenames.py
--debug
flag that will give more verbose output. No longer read "VERBOSE" variable from config.yaml for this script.get_crontab_contents.py
New tests
custom_grids
directory. The new custom domains were chosen to span a variety of locations, terrain types, and dates. Aside fromcustom_ESGgrid_Great_Lakes_snow_8km
, these are basic tests that don't use many non-default settings, and so are good candidates for testing new SRW capabilities in the future. See the "Documentation" section for more details.Additional changes
custom_ESGgrid_Great_Lakes_snow_8km
(introduced in [develop] Consolidate verification tasks using retrieve_data.py #864) moved from verification to custom_tests, with a symbolic link (test_configs/verification/config.MET_verification_winter_wx.yaml
) left behind. Also the ICs/LBCs are now from RAP output, retrieved from HPSS (since the observation data requires HPSS access anyway this is fine)Type of change
--use_cron_to_relaunch
argument torun_WE2E_tests.py
(gives user a verbose message about replacement flag)TESTS CONDUCTED:
The new script behavior was tested for each of the new flags, ensured that they behaved as expected.
DEPENDENCIES:
This branch currently contains changes from #859 and #864, and so should not be merged until those PRs are approved.Merged!DOCUMENTATION:
Updated command-line arguments were documented within the scripts. Additional documentation for the updates was provided as part of PR #880.
Here are some details/sample plots for the new custom domains:
custom_ESGgrid_Central_Asia_3km
: A 6-hour forecast starting 2019070100 over central Asia, including much of the Tibetan Plateau and Tarim Basin; this area was chosen for its extreme topography and climate range, with no ocean points. Uses staged FV3GFS nemsio data.custom_ESGgrid_Great_Lakes_snow_8km
: A 6-hour forecast starting 2023021700 over the North American Great Lakes region. Uses RAP netCDF data retrieved from HPSS, and tests a non-standard resolution (8 km). Tests deterministic verification for snow accumulation (as well as other variables), retrieving observations from HPSS (therefore can only be run on Jet and Hera).custom_ESGgrid_IndianOcean_6km
: A 12-hour forecast starting 2019061518 over the central Indian Ocean; this area was chosen for its lack of land points and as a southern hemisphere test. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (6 km).custom_ESGgrid_NewZealand_3km
: A 6-hour forecast starting 2019061518 over New Zealand; this area was chosen for its extreme topography and mix of land/ocean points, as well as spanning the international date line. Uses staged FV3GFS grib2 data.custom_ESGgrid_Peru_12km
: A 12-hour forecast starting 2019061500 over western South America, centered over Peru; this area was chosen for its extreme topography and mix of land/ocean/lake points, as well as spanning the equator. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (12 km).custom_ESGgrid_SF_1p1km
: A 6-hour forecast starting 2019061500 over central coastal California, centered over the San Francisco Bay area. Uses staged FV3GFS grib2 data, and tests a non-standard resolution (1.1 km).ISSUE:
CHECKLIST
CONTRIBUTORS:
Thanks to @gsketefian for providing some custom grids for new tests