Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update wrapper scripts #883

Conversation

EdwardSnyder-NOAA
Copy link
Collaborator

@EdwardSnyder-NOAA EdwardSnyder-NOAA commented Aug 14, 2023

DESCRIPTION OF CHANGES:

The wrapper scripts are a set of scripts that run the tasks from a SRW experiment. They are used in a situation where rocoto is not installed or available to use. These set of wrapper scripts have become outdated and don't work in its current state. This PR was created to address this issue.

These are the steps I used to test the wrapper scripts:

  • Copy the config.default.yaml to config.yaml. Update the machine and account variables and add the changes noted by (**) below:
task_get_extrn_ics:
  EXTRN_MDL_NAME_ICS: FV3GFS
  FV3GFS_FILE_FMT_ICS: grib2
  **USE_USER_STAGED_EXTRN_FILES: true
  **EXTRN_MDL_SOURCE_BASEDIR_ICS: /path/to/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/2019061518
task_get_extrn_lbcs:
  EXTRN_MDL_NAME_LBCS: FV3GFS
  LBC_SPEC_INTVL_HRS: 6
  FV3GFS_FILE_FMT_LBCS: grib2
  **USE_USER_STAGED_EXTRN_FILES: true
  **EXTRN_MDL_SOURCE_BASEDIR_LBCS: /path/to/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/2019061518
  • Create the experiment by running the generate script
  • salloc or qsub a compute node
  • cd into the experiment and run: export EXPTDIR=$PWD
  • On Cheyenne, Gaea, and NOAA cloud export nprocs. Take the nprocs value for the run_fcst task which can be found in the FV3LAM_wflow.xml in the experiment directory. For the cloud, this is the command that was used: export nprocs=24
  • module use /path/to/ufs-srweather-app/modulefiles
  • module load build and wflow files
  • run conda activate workflow_tools (note on Hera python wasn't setup properly so I had to run conda deactivate and rerun the activate command)
  • Run scripts by hand:
    run_make_grid
    run_get_ics
    run_get_lbcs
    run_make_orog
    run_make_sfc_climo
    run_make_ics
    run_make_lbcs
    run_fcst
    run_post

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel path to expt dir: /scratch1/NCEPDEV/nems/Edward.Snyder/wrapper-test/expt_dirs/test_community
  • orion.intel path to expt dir: /work/noaa/epic/esnyder/wrapper-test/expt_dirs/test_community
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform) (gcp) path to expt dir: /contrib/Edward.Snyder/wrapper-test/expt_dirs/test_community
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

Ran the ci/cd script on Hera and all the NOAA Cloud platforms.

DEPENDENCIES:

Removed a step from the wrapper script workflow documentation as these variables are included in the wrapper scripts now.

ISSUE:

This PR address issue #878.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@EdwardSnyder-NOAA EdwardSnyder-NOAA marked this pull request as ready for review August 16, 2023 21:59
@MichaelLueken MichaelLueken added bug Something isn't working run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Aug 17, 2023
@MichaelLueken MichaelLueken changed the title Update wrapper scripts [develop] Update wrapper scripts Aug 17, 2023
@MichaelLueken
Copy link
Collaborator

All of the srw_ftest.sh (Functional Workflow Task Tests) stage failed on all platforms. The tests failed in run_make_sfc_climo. I'm seeing the following error message on Hera Intel and Hera GNU:

fs-srweather-app_pipeline_PR-883/scripts/exregional_make_sfc_climo.sh: line 65: ulimit: stack size: cannot modify limit: Operation not permitted

The error message on Gaea is:

srun: error: eio_message_socket_accept: slurm_receive_msg[192.188.178.202:46054]: Connection reset by peer
srun: error: _accept_msg_connection[192.188.178.202:53232]: Connection reset by peer
srun: error: Unable to allocate resources: Connection reset by peer

It is unclear to me what is happening here. The error message on Jet is:

srun: error: required parameter -A <account> not specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

The account is being set as:
ACCOUNT=no_account
in the srw_ftest.sh script. This will be problematic for tasks that need an account to run. It looks like:
[[ -n ${ACCOUNT} ]] || ACCOUNT="no_account"
will need to be replaced with:
[[ -n ${SRW_PROJECT} ]] || SRW_PROJECT="no_account"
in order to correct this. The Orion test failed with the following error message:

srun: Force Terminated job 14696707
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 14696707.0 ON Orion-03-61 CANCELLED AT 2023-08-17T09:08:55 DUE TO TIME LIMIT ***

It looks like additional resources will be required on Orion.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA - I'll go ahead and relaunch the Jenkins test now and see if the tests progress following this latest update.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA - The tests have failed on Hera. Looking at /scratch1/NCEPDEV/stmp2/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-883/scripts/exregional_make_sfc_climo.sh, line 65, I see the culprit. This script will need to replace:

ulimit -s unlimited

with:

eval ${PRE_TASK_CMDS}

without this change, the scripts will never work on Hera.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA - The functional workflow task tests are still failing on all machines.

The current failure and probable fix for Hera has been documented here.

The issue on Gaea appears to be that the make_sfc_climo task is being run in interactive QOS on the es cluster. The test should use the windfall QOS on the c4 cluster. It isn't clear to me why this is happening on Gaea, but I suspect that this is the cause of the current failure on the machine.

Jet is continuing to claim that the account or account/partition isn't correct:

srun: error: required parameter -A <account> not specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

This also doesn't make sense, as the ACCOUNT is being set to epic (i.e., ACCOUNT=epic is in the logfiles, ACCOUNT: epic is in config.yaml and rocoto_vars.yaml, and ACCOUNT "epic" is in the XML file).

Orion is still failing due to hitting the walltime limit in make_sfc_climo.

@EdwardSnyder-NOAA
Copy link
Collaborator Author

Thank you for looking into these failures, @MichaelLueken! I'm pretty sure all these failures are related to the fact that the srw_ftest.sh runs on a head node and not a compute note. When running this script by hand, I was on a compute note. I'll be adding a wrapper pbs/sbatch job card so we can run this script on a compute node. I'll let you know when I'm done testing it out.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA - The Functional Workflow Task Tests failed in the make_sfc_climo task on all machines.

On Gaea, the test is still attempting to launch an interactive job on the login node, rather than the compute node.

On Hera, the exregional_make_sfc_climo.sh script is still setting ulimit -s unlimited, which is causing the failures there.

On Jet, the account is still an issue:

srun: error: required parameter -A <account> not specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

On Orion, the walltime limit is still being hit:

slurmstepd: error: *** STEP 14714277.0 ON Orion-08-48 CANCELLED AT 2023-08-18T16:59:37 DUE TO TIME LIMIT ***

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA - I think I see the issue. While you have introduced the new .cicd/scripts/wrapper_srw_ftest.sh script, the .cicd/Jenkinsfile is still pointing to .cicd/scripts/srw_ftest.sh, so it is still only attempting to run on the login node, not the compute nodes. If you replace line 154 of the Jenkinsfile with:

sh 'bash --login "${WORKSPACE}/.cicd/scripts/wrapper_srw_ftest.sh"'

the Jenkins tests should properly use compute nodes.

@EdwardSnyder-NOAA
Copy link
Collaborator Author

@MichaelLueken - thanks for catching that! I got the wrapper_srw_ftest script to pass on all platforms except for Cheyenne/Derecho. I added those changes, so it should pass on all platforms except for Cheyenne/Derecho.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA -

It looks like wrapper_srw_ftest.sh will need to have some safeguards added to it. This new script is launched successfully in the Jenkins pipeline, but automatically moves on to the Test phase without waiting for the Functional Workflow Task Tests to finish. It appears that this is causing issues with the Test phase, since the Functional Workflow Task Tests phase is still attempting to run. Do to this interaction, the Test phase is failing for all machines. Please see:

https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-883/4/pipeline

for the latest test run.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA - I have resubmitted the Jenkins tests. Unfortunately, Orion is down for maintenance, but the updated wrapper_srw_ftest.sh should pass for the rest of the machines and the Tests stage should run without issue. When Orion comes back up tomorrow, I'll be able to submit the test there. I'll let you know if I see any failures (other than Orion, which had to be manually killed).

@EdwardSnyder-NOAA
Copy link
Collaborator Author

Thanks for the update @MichaelLueken! It looks like the srw_ftest script passed on all platforms. The tests phase is still running for jet though. My most recent changes allow this script to run derecho.

@MichaelLueken MichaelLueken linked an issue Aug 24, 2023 that may be closed by this pull request
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA - I submitted the Jenkins tests on Orion this morning and the Functional Workflow Task Tests successfully passed. Approving this work now.

@MichaelLueken
Copy link
Collaborator

The WE2E coverage tests were manually run on Jet and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              15.93
custom_ESGgrid                                                     COMPLETE              19.18
custom_GFDLgrid                                                    COMPLETE              13.75
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018         COMPLETE              19.60
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h     COMPLETE              55.99
get_from_HPSS_ics_RAP_lbcs_RAP                                     COMPLETE              19.78
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR                 COMPLETE             246.77
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              46.54
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               9.24
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta       COMPLETE             579.88
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR       COMPLETE               8.86
process_obs                                                        COMPLETE               0.54
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1036.06

Moving forward with the merge now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Update wrapper scripts to work on all supported systems
3 participants