Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsurdat: PCT_SAND, PCT_CLAY, ORGANIC differ with different PE layouts on derecho #2502

Closed
slevis-lmwg opened this issue Apr 30, 2024 · 14 comments · Fixed by #2500
Closed

fsurdat: PCT_SAND, PCT_CLAY, ORGANIC differ with different PE layouts on derecho #2502

slevis-lmwg opened this issue Apr 30, 2024 · 14 comments · Fixed by #2500
Assignees
Labels
bug something is working incorrectly science Enhancement to or bug impacting science
Milestone

Comments

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Apr 30, 2024

Brief summary of bug

I ran mksurfdata_esmf on derecho to generate fsurdat/landuse files for the VR grids ne0np4CONUS, ne0np4.ARCTIC, and ne0np4.ARCTICGRIS grids (PR #2490 iss #2487). Accidentally, I tried two PE layouts:

  • I see no diffs in the landuse files.
  • I see diffs in the fsurdat files. The fsurdat files show the different number of tasks used:
<               :Host = "derecho7" ;
<               :Number-of-tasks = 256 ;
---
>               :Host = "derecho6" ;
>               :Number-of-tasks = 1152 ;

Possibly related to issue #2430.

General bug information

CTSM version you are using: ctsm5.2.001

Does this bug cause significantly incorrect results in the model's science? Maybe

Configurations affected: All ctsm5.2.0 and newer, as well as hacked simulations that use 5.2 fsurdat files

Details of bug

I used /glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc -m <file1> <file2>
to get info like this:

 PCT_SAND   (gridcell,nlevsoi)
        281  1523900  ( 38260,     7) ( 77809,     1) ( 30259,     5) ( 30260,     8)
             1523900   9.500000000000000E+01   8.000000000000000E+00 4.5E+01  8.800000000000000E+01 4.2E-05  1.500000000000000E+01
             1523900   9.500000000000000E+01   8.000000000000000E+00          4.300000000000000E+01          4.300000000000000E+01
             1523900  ( 38260,     7) ( 77809,     1)
          avg abs field values:    4.507733154296875E+01    rms diff: 2.2E-01   avg rel diff(npos):  4.2E-05
                                   4.507754516601562E+01                        avg decimal digits(ndif):  0.8 worst:  0.2
 RMS PCT_SAND                         2.1765E-01            NORMALIZED  4.8284E-03

 PCT_CLAY   (gridcell,nlevsoi)
        269  1523900  ( 30936,     8) ( 38260,     1) ( 30260,     8) ( 42656,     6)
             1523900   7.400000000000000E+01   2.000000000000000E+00 4.6E+01  6.400000000000000E+01 6.5E-05  3.400000000000000E+01
             1523900   7.400000000000000E+01   2.000000000000000E+00          1.800000000000000E+01          6.000000000000000E+00
             1523900  ( 30936,     8) ( 38260,     1)
          avg abs field values:    1.737113952636719E+01    rms diff: 1.9E-01   avg rel diff(npos):  6.5E-05
                                   1.737069702148438E+01                        avg decimal digits(ndif):  0.6 worst:  0.1
 RMS PCT_CLAY                         1.9311E-01            NORMALIZED  1.1117E-02

 ORGANIC   (gridcell,nlevsoi)
        290  1523900  ( 36565,     5) (     1,     1) ( 42634,     8) ( 30207,     1)
             1523900   2.974772033691406E+02   0.000000000000000E+00 1.7E+02  1.733897705078125E+02 1.6E-04  4.729569244384766E+01
             1523900   2.974772033691406E+02   0.000000000000000E+00          0.000000000000000E+00          0.000000000000000E+00
             1523900  ( 36565,     5) (     1,     1)
          avg abs field values:    1.125364875793457E+01    rms diff: 8.5E-01   avg rel diff(npos):  1.6E-04
                                   1.124512195587158E+01                        avg decimal digits(ndif):  0.1 worst:  0.0
 RMS ORGANIC                          8.4824E-01            NORMALIZED  7.5403E-02

@ekluzek proposed this follow-up:
Perform testing with f09 to make easier to visualize (VR are unstructured grids and difficult to view).

@slevis-lmwg slevis-lmwg self-assigned this Apr 30, 2024
@slevis-lmwg slevis-lmwg added priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations investigation Needs to be verified and more investigation into what's going on. tag: support tools only labels Apr 30, 2024
@slevis-lmwg slevis-lmwg added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label May 2, 2024
@slevis-lmwg slevis-lmwg added this to the ctsm5.3.0 milestone May 2, 2024
@ekluzek ekluzek removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label May 2, 2024
@slevis-lmwg slevis-lmwg moved this from Todo to In Progress in LMWG: Near Term Priorities May 6, 2024
@slevis-lmwg
Copy link
Contributor Author

I have submitted a 4-node job and an 8-node job:

qsub mksurfdata_jobscript_single
qsub mksurfdata_jobscript_single_8nodes.sh

in /glade/work/slevis/git/latest_master/tools/mksurfdata_esmf
git describe: ctsm5.2.003
The jobs point to

surfdata_0.9x1.25_hist_2000_78pfts_c240506.namelist
surfdata_0.9x1.25_hist_2000_78pfts_c240506b.namelist

@slevis-lmwg
Copy link
Contributor Author

First I compare two files that I expect (hope) to be identical because derecho generated them on the same number of nodes. I'm relieved to find that they are indeed identical:

surfdata_0.9x1.25_hist_2000_78pfts_c240506
surfdata_0.9x1.25_hist_2000_78pfts_c240216

Next I compare the two files that I generated today:

surfdata_0.9x1.25_hist_2000_78pfts_c240506b
surfdata_0.9x1.25_hist_2000_78pfts_c240506

and find diffs as shown in the following sample ncview images.

@slevis-lmwg
Copy link
Contributor Author

surfdata_f09_2000_78pfts_c240506b-a_pctsand_nlevsoi0
surfdata_f09_2000_78pfts_c240506b-a_pctclay_nlevsoi0
surfdata_f09_2000_78pfts_c240506b-a_ORGANIC_nlevsoi0

@slevis-lmwg
Copy link
Contributor Author

landmask for reference

surfdata_f09_2000_78pfts_c240506_pctocn

@slevis-lmwg
Copy link
Contributor Author

My assessment of this visual examination:
A very small number of grid cells show differences at f09, but differences in those locations can be large.

@slevis-lmwg slevis-lmwg moved this from In Progress to Todo in LMWG: Near Term Priorities May 7, 2024
@samsrabin samsrabin added bug something is working incorrectly science Enhancement to or bug impacting science and removed bug - impacts science labels Aug 8, 2024
@ekluzek ekluzek removed the priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations label Aug 21, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Aug 21, 2024

This seems like an unlikely thing to be able to work on and resolve by ctsm5.3.0. Since, the number of gridcells affected is small that might be OK, but the fact that the differences is large is concerning.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Aug 21, 2024
@samsrabin
Copy link
Collaborator

I wonder if it's related to my "ambiguous nearest neighbors" issue: ESMF issue #276: For nearest-neighbor remapping, ensure results are independent of processor count if there are equidistant source points

You can test by shifting the input datasets by a tiny amount (I used 1e-6°).

@wwieder
Copy link
Contributor

wwieder commented Aug 29, 2024

This would be nice to fix, but is likely related to the ESMF issue @samsrabin noted about nearest neighbor issues with different PE counts. Falls in the quality of life category (for now), but should be addressed by the CESM3 release.

If this is a quick fix it would let us create more accurate 5.3 surface data. Let's not spend more half a day of active time testing this to see if it work and then implementing it (roughly).

@slevis-lmwg
Copy link
Contributor Author

I likely deleted earlier samples of this problem, so I have generated new ones in
/glade/work/slevis/git/mksurfdata_toolchain/tools/mksurfdata_esmf
File generated with 512 tasks: surfdata_0.9x1.25_hist_1850_78pfts_c240826.nc
File generated with 256 tasks: surfdata_0.9x1.25_hist_1850_78pfts_c240826b.nc
In the same directory I placed the difference between these two files: b-a.nc

@slevis-lmwg
Copy link
Contributor Author

My latest test still fails unfortunately. I generated an fsurdat file four times as follows:

/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_oldmesh1
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_oldmesh2
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_newmesh1
/glade/derecho/scratch/slevis/temp_work/new_rawdata/tools/mksurfdata_esmf/ptvg_newmesh2

where suffixes 1 used 512 tasks and suffixes 2 used 256 tasks and
where old (default) mesh and new (tweaked) mesh are, respectively:

11c11
<   mksrf_fsoitex_mesh = '/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/mappingdata/grids/UNSTRUCTgrid_5x5min_nomask_cdf5_c200129.nc'
---
>   mksrf_fsoitex_mesh = '/glade/work/samrabin/5x5_meshfile_tweaked/UNSTRUCTgrid_5x5min_nomask_cdf5_c200129.tweaked_latlons.nc'

I used ncdiff and found that the tweaked files differ similarly to the way that the default files differ.

@samsrabin thank you for the time that you put into trying out your hypothesis. I don't know whether this result rules out your hypothesis or whether there is more experimentation that could be done. What are your thoughts? Either way, we will probably need to follow up post ctsm5.3.

@samsrabin
Copy link
Collaborator

Thanks for checking, @slevis-lmwg. Let's plan to do another test once the ESMF bug is fixed—I think your latest test shows that's not the issue, but maybe worth a shot.

@billsacks
Copy link
Member

As @wwieder pointed out in #2744 (comment), the fix in slevis-lmwg#9 is likely to resolve this issue.

@ekluzek
Copy link
Collaborator

ekluzek commented Sep 6, 2024

Thanks @billsacks I was hoping that might be the case and I'll redo the testing that @slevis-lmwg did and see if that's correct.

@ekluzek
Copy link
Collaborator

ekluzek commented Sep 7, 2024

Hurray! I tried slevis-lmwg#9 out for f09-1850 with 256 processors and 128 and am getting identical results between the two now. So this is really good news!

@ekluzek ekluzek removed investigation Needs to be verified and more investigation into what's going on. next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Sep 9, 2024
@slevis-lmwg slevis-lmwg moved this from Todo to Done in LMWG: Near Term Priorities Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly science Enhancement to or bug impacting science
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants