Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI changes to read and assimilate IASI-NG #805

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from

Conversation

wx20jjung
Copy link
Contributor

Before opening a PR, please note these guidelines:

  • Each PR should only address ONE topic and have an associated issue
  • No hardcoded or paths to personal directories should be present
  • No temporary or backup files should be committed
  • Any code that was disabled by being commented out should be removed
    -->

Description

Metop-sg-a1 will be launched next year. There are several new instruments on this satellite. This pull request addresses some of the changes necessary to read and use IASI-NG data in the GSI and ultimately the global-workflow.

This code adds the read routine for IASI-NG (read_iasing.f90) and adds logic throughout the GSI to assimilate these data. The code is setup to use data from the standard operational feed and both direct broadcast feeds. The cloud and aerosol detection software (CADS) is also setup and can be turned on/off with a flag, similar to the current IASI and CrIS instruments.

There should be no dependencies needed to incorporate these changes. There are several dependencies to be able to use the IASI-NG data including; connecting to the data, various CRTM coefficient files, and choosing to use CADS.

This pull request is addressing issue #804

Type of change

  • [X ] New feature (non-breaking change which adds functionality)

How Has This Been Tested?
I have tested these changes on about 6 hours of proxy IASI-NG data using a modified version of the global-workflow, including the CRTM with IASI-NG coefficient files. I also modified the satinfo and scaninfo files to monitor up to 10,000 channels as well as a potential 500 channel subset. This was conducted on S4 and Hera. These changes do not affect the analysis when the IASI-NG data are not available.

The ctests were conducted on Hera with reproducible results.

The Hera tests:
The global_enkf failed due to time limits on the first try. It passed on the second try.

[Jim.Jung@hfe10 build]$ ctest -j 6
Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/save/ctests/update/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 1526.62 sec
2/6 Test #5: hafs_3denvar_hybens .............. Passed 2256.26 sec
3/6 Test #4: hafs_4denvar_glbens .............. Passed 2323.86 sec
4/6 Test #2: rtma ............................. Passed 2414.80 sec
5/6 Test #6: global_enkf ......................***Failed 2446.27 sec
6/6 Test #1: global_4denvar ................... Passed 2985.02 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 2985.07 sec

The following tests FAILED:
6 - global_enkf (Failed)
Errors while running CTest
Output from these tests are in: /scratch1/NCEPDEV/jcsda/Jim.Jung/save/ctests/update/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
[Jim.Jung@hfe10 build]$ ctest -R global_enkf
Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/save/ctests/update/build
Start 6: global_enkf
1/1 Test #6: global_enkf ...................... Passed 1234.16 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1234.18 sec

My Jet time is limited. I will try again when it is returned from maint.

Running the standard ctests should reproduce results.

None

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

wx20jjung and others added 17 commits June 12, 2024 12:19
…on. In this case it is METImage. Specifically, the CRTM coefficent files are needed by CADS. The CRTM coefficient files used are determined by the satinfo file. The METImage entry in the satinfo file generates error messages in various parts of the GSI. These changes silence the error messages. At this time, there is no METImage assimilation code in the GSI.
Conflicts:
	src/gsi/gsimod.F90
	src/gsi/mrmsmod.f90
	src/gsi/read_cris.f90
	src/gsi/read_iasi.f90
	src/gsi/read_obs.F90
	src/gsi/statsrad.f90
Merge branch 'IASI-NG' of https://github.com/wx20jjung/GSI into IASI-NG
Merge remote-tracking branch 'emc/develop' into IASI-NG
@wx20jjung
Copy link
Contributor Author

@ADCollard , @DavidHuber-NOAA , @InnocentSouopgui-NOAA would you be willing to review these changes?

src/gsi/crtm_interface.f90 Outdated Show resolved Hide resolved
src/gsi/gsimod.F90 Outdated Show resolved Hide resolved
src/gsi/gsimod.F90 Outdated Show resolved Hide resolved
src/gsi/qcmod.f90 Show resolved Hide resolved
src/gsi/qcmod.f90 Outdated Show resolved Hide resolved
src/gsi/read_iasing.f90 Outdated Show resolved Hide resolved
src/gsi/read_iasing.f90 Show resolved Hide resolved
src/gsi/read_iasing.f90 Outdated Show resolved Hide resolved
@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA for reviewing this PR.

@wx20jjung , who is the second peer reviewer we should assign to this PR?

Co-authored-by: David Huber <[email protected]>
@wx20jjung
Copy link
Contributor Author

wx20jjung commented Nov 5, 2024 via email

@wx20jjung
Copy link
Contributor Author

I have pushed @DavidHuber-NOAA changes to github. I am not able to test them until Hera is returned to users tonight.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA I think on jet it had the same issue we saw on other machines when a small number of MPI tasks was used. Would you mind change parameters in line 104 and 105 in regression_param.sh to the similar/same values as them for Orion or hera to see if it would resolve this issue?

Thank you @TingLei-NOAA for the suggestion. Why might changing the task count allow the test to pass? Are we dealing with a memory issue?

@RussTreadon-NOAA This is because of the GSI issue #766, for which the latest suggestion from the expert at the helpdesk is to try different compiler. For being now, I 'd prefer the current work-around.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , I find it disturbing that our solution is to increase the task count when we encounter this problem on different machines. It seems this failure is occurring on an increasing number of machines (e.g, see this comment in GSI PR #800) At some point we need someone to invest time on a deep dive into this issue.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Nov 6, 2024 via email

@RussTreadon-NOAA
Copy link
Contributor

@russ Treadon - NOAA Federal @.> This seems to me an general issue with netcdf parallelization using hdf5, and not specific to machines. I agree it should be clearly resolved. However, as shown in that issue, my investigation with help from the expert at the helpdesk hasn't succeeded yet. .

______________________________ Ting Lei Physical Scientist, Contractor with Lynker in support of EMC/NCEP/NWS/NOAA 5830 University Research Ct., Cubicle 2765 College Park, MD 20740 @.
301-683-3624
On Wed, Nov 6, 2024 at 2:09 PM RussTreadon-NOAA @.> wrote: @TingLei-NOAA https://github.com/TingLei-NOAA , I find it disturbing that our solution is to increase the task count when we encounter this problem on different machines. It seems this failure is occurring on an increasing number of machines (e.g, see this comment <#800 (comment)> in GSI PR #800 <#800>) At some point we need someone to invest time on a deep dive into this issue. — Reply to this email directly, view it on GitHub <#805 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEFS7BGQKEK3HG6HQMGIETZ7JSOFAVCNFSM6AAAAABRFRPISKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRQGU3DMOJXGM . You are receiving this because you were mentioned.Message ID: @.>

I disagree but we can live with the work around for the time being. Please open a PR to adjust the rrfs_3denvar_rdasens task count on affected machines.

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Could those changes as you did for the work-around for rdas ctest be just done in this PR?

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA, please discuss and coordinate with @wx20jjung . You should also check with @DavidBurrows-NCO since rrfs_3denvar_rdasens fails on Gaea.

If increasing the task count on three machines allows rrfs_3denvar_rdasens to run, I'm more suspicious of the GSI code than the compilers and modules on three machines.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA informed me that he will be on leave and is unable to update the task count for rrfs_3denvar_rdasens on various machines.

@wx20jjung , please commit the modified regression/regression_param.sh you used to get a successful rrfs_3denvar_rdasens run on Jet. Thank you!

@wx20jjung
Copy link
Contributor Author

@RussTreadon-NOAA These are the changes I made:
diff regression_param.sh regression_param.sh_orig
104,105c104,105
< topts[1]="0:15:00" ; popts[1]="40/3/" ; ropts[1]="/1"
< topts[2]="0:15:00" ; popts[2]="40/5/" ; ropts[2]="/1"

       topts[1]="0:15:00" ; popts[1]="5/4/"  ; ropts[1]="/1"
       topts[2]="0:15:00" ; popts[2]="10/4/"  ; ropts[2]="/1"

I will commit and push them momentarily.

@wx20jjung
Copy link
Contributor Author

@RussTreadon-NOAA Jet changes were pushed to github and has passed internal checks.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Thank you @wx20jjung for updating the Jet job configuration for the rrfs ctest.

@RussTreadon-NOAA
Copy link
Contributor

This PR is awaiting the return of WCOSS2 to developers so WCOSS2 ctests can be run. Assuming acceptable WCOSS2 results, this PR can be merged into develop.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 ctests
Install wx20jjung:IASI-NG at 7518ed8 and develop at b0e3cba on Cactus. Run ctests with the following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr805/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  730.06 sec
2/6 Test #6: global_enkf ......................   Passed  855.88 sec
3/6 Test #2: rtma .............................   Passed  974.56 sec
4/6 Test #4: hafs_4denvar_glbens ..............***Failed  1220.35 sec
5/6 Test #5: hafs_3denvar_hybens ..............***Failed  1220.53 sec
6/6 Test #1: global_4denvar ...................***Failed  1743.61 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 1743.70 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          5 - hafs_3denvar_hybens (Failed)

Unfortunately each of the above failures is due to the fact that the updat and contrl analyses differ. The initial penalties are identical. Differences arise in the minimization. It's odd that this behavior is only observed on WCOSS2. ctests pass on other platforms. The WCOSS2 GSI build uses intel/19. The builds on other machines are intel/20+.

@RussTreadon-NOAA
Copy link
Contributor

Create stand-alone script to run global_4denvar on Cactus. Simplify configuration to 3dvar with no FGAT & no constraints. Only assimilate CrIS and IASI. Non-reproducible results between cntrl (develop) and updat (wx20jjung:IASI-NG) remain. The reason(s) for the non-reproducible results must be understand and, hopefully, resolved before this PR can move forward.

@wx20jjung , I see that you have a WCOSS2 account. Can you log onto Cactus and investigate? The stand-alone script I have is /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/scripts/run_global_4denvar.sh

@RussTreadon-NOAA
Copy link
Contributor

global_4denvar debugging on Cactus

Recompile develop and wx20jjung:IASI-NG in debug mode. This is done by setting BUILD_TYPE=Debug in ush/build.sh. Submit run_global_4denvar.sh for each executable. Analysis results are identical.

Does optimization alter the instruction order between the two source codes - develop -vs- wx20jjung:IASI-NG? Note - all global_4denvar debug runs set OMP_NUM_THREADS=1.

@RussTreadon-NOAA
Copy link
Contributor

Build develop and wx20jjung:IASI-NG on Cactus using -O2 optimization. Analysis results differ. Change to -O1 optimization. Analysis results are identical. Debug build use -O0 optimization.

@RussTreadon-NOAA
Copy link
Contributor

Rebuild develop and wx20jjung:IASI-NG on Cactus using -O3 optimization (default value). Set CRIS_CADS=.false. and IASI_CADS=.false.. Analysis results differ.

@wx20jjung
Copy link
Contributor Author

I do not have access to restricted data on wcoss yet. Removing only the restricted data from both develop and update generated the following results:
jim.jung@dlogin03:/lfs/h2/emc/da/noscrub/jim.jung/ctests/update/build> ctest -j 6
Test project /lfs/h2/emc/da/noscrub/jim.jung/ctests/update/build
Start 1: global_4denvar
Start 6: global_enkf
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
1/6 Test #6: global_enkf ...................... Passed 251.05 sec
2/6 Test #3: rrfs_3denvar_rdasens ............. Passed 727.29 sec
3/6 Test #2: rtma ............................. Passed 969.02 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 1154.36 sec
5/6 Test #4: hafs_4denvar_glbens .............. Passed 1213.25 sec
6/6 Test #1: global_4denvar ................... Passed 1443.14 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1443.27 sec

I can't do further tests until I am added to the rstprod group.

@RussTreadon-NOAA
Copy link
Contributor

global_4denvar has the following rstprod dump files

+##$nln $datobs/${prefix_obs}.prepbufr                ./prepbufr
+##$nln $datobs/${prefix_obs}.prepbufr.acft_profiles  ./prepbufr_profl
+##$nln $datobs/${prefix_obs}.nsstbufr                ./nsstbufr
+##$nln $datobs/${prefix_obs}.gpsro.${suffix}         ./gpsrobufr
+##$nln $datobs/${prefix_obs}.saphir.${suffix}        ./saphirbufr

As shown above, comment out these dump files from global_4denvar.sh. Run global_4denvar ctest with following result

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr805/build
    Start 1: global_4denvar
1/1 Test #1: global_4denvar ...................***Failed  1568.71 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 1569.30 sec

The following tests FAILED:
          1 - global_4denvar (Failed)

The initial step size on the first iteration of the first loop differ between the updat and contrl. The same behavior is observed when all dump files are processed.

@RussTreadon-NOAA
Copy link
Contributor

@wx20jjung , your passed case does not assimilate microwave radiances. I repeat the no_rstprod test with the link for microwave radiance dump files also commented out in global_4denvar.sh. Unfortunately, my updat and contrl results still differ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants