Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using gcchem_internal_checkpoint for grid-stretching simulation but fails with 'Factories not equal' error. #404

Open
yuz-wx opened this issue Apr 16, 2024 · 18 comments · Fixed by geoschem/MAPL#34
Assignees
Labels
category: Bug Something isn't working stale No recent activity on this issue topic: Stretched Grid Specific to stretched grid simulation

Comments

@yuz-wx
Copy link

yuz-wx commented Apr 16, 2024

Name and Institution (Required)

Name: YU, Zheng
Institution: The Chinese University of Hong Kong, EASC

Confirm you have reviewed the following documentation

Description of your issue or question

Please provide as much detail as possible. Always include the GCHP version number and any relevant configuration and log files.

I use GCPY to create a restart file called new_restart_file.nc to run a grid-stretching case by GCHP14.2.2, it can run successfully. After that, I link the Restarts/gcchem_internal_checkpoint as gchp_restart.nc4 to run a new case from a new start time. Then it has such an error (Factories not equal) when running a new case. I've checked the global attributes of gcchem_internal_checkpoint and new_restart_file, they are the same.

settings in GCHP.rc:
image

new_restart_file:
image

gcchem_internal_checkpoint:
image

So I print out some lines in base/MAPL_CubedSphereGridFactory.F90, and I find the a%target_lat (from the case settings) and this%target_lat (from the gcchem_internal_checkpoint) are not equal, even if they are both calculated by 37.0 x pi / 180.d0. Since the values of a%target_lat and this%target_lat are 0.6457718 and 0.6457717. It's like a casual float trap caused this problem. So I'm not sure if GCHP has some compiling options to void such an issue? Or can such a comparation be more simple in the future, like just compare the '37.0'.

The cause of 'Factories not equal'
image
image

The calculation of target_lat:
image

The calculation for a%target_lat:
image

The calculation for this%target_lat:
image

@yuz-wx yuz-wx changed the title Using GCHP generated restart file Using GCHP generated restart file (gcchem_internal_checkpoint) for grid-stretching simulation but fails with 'Factories not equal' error. Apr 16, 2024
@yuz-wx yuz-wx changed the title Using GCHP generated restart file (gcchem_internal_checkpoint) for grid-stretching simulation but fails with 'Factories not equal' error. Using gcchem_internal_checkpoint for grid-stretching simulation but fails with 'Factories not equal' error. Apr 16, 2024
@yantosca yantosca added category: Bug Something isn't working topic: Stretched Grid Specific to stretched grid simulation topic: Offline Regridding Related to regridding files before or after run labels Apr 16, 2024
@yantosca
Copy link
Contributor

Thanks for writing @yuz-wx. There was a similar issue #318 that was solved by using longitude in 0-360 range when specifying the target longitude. But I'm not sure that applies to your issue.

However, we just received a fix in GCPy (see geoschem/gcpy#311) for cubed sphere regridding. We will bring this into the dev branch soon. In the meantime you could pull the fix into your GCPy clone and then see if that results in a better-regridded restart file.

@yuz-wx
Copy link
Author

yuz-wx commented Apr 16, 2024

Dear yantosca, the restart file made by GCPY can be used successfully, but the GCHP generated restart file has such an issue. I think the cause of the issue is the inversion from double64 to real32. A double64 number is directly given to a real32 variable, which carries a truncation error because it is not the rounding number. I revised related lines in 'base/MAPL_CubedSphereGridFactory.F90' by an inelegant way to solve this problem temporarily and let the factories equal (it turns 0.645771756658875 to 0.6457718, instead of 0.6457717). But I think it might be better to compare the target_lat written in restart file and GCHP.rc directly (compare 37.0 with 37.0), instead of comparing the calculated one (compare 0.6457718 with 0.6457718). Hope it can be updated in the future. Thank you very much!

Here's my revised lines:
image

The result of my revise:
image

@yantosca
Copy link
Contributor

Thanks @yuz-wx. Tagging @lizziel and @yidant

@yantosca yantosca added topic: Online Regridding Related to MAPL regridding during run-time and removed topic: Offline Regridding Related to regridding files before or after run labels Apr 16, 2024
@lizziel
Copy link
Contributor

lizziel commented Apr 16, 2024

Hi @yuz-wx, thanks for raising this issue! We have not had a report before of this problem but I can see how it might show up. Doing a check that the stretched grid attributes match up between restart and config is a relatively recent addition to MAPL. I will create an issue with NASA GMAO and also try to put a work-around into our MAPL fork in time for 14.4.0. I will link to the PR here once I have one.

@lizziel lizziel self-assigned this Apr 16, 2024
@lizziel lizziel removed the topic: Online Regridding Related to MAPL regridding during run-time label Apr 16, 2024
@yuz-wx
Copy link
Author

yuz-wx commented Apr 17, 2024

Hi @lizziel, thank you for your comment! Look forward to the following update!

Here's my conclusion for advice:

  1. Just compare the value written in restart and config might be better, instead of the calculated values.
  2. If it must compare the calculated values, please focus on the precision of different values.
  3. I think using rounding value is not a good idea because when the results are 0.645771756658875 and 0.645771746658875, it will compare 0.6457718 and 0.6457717 and fails.

Here's the related lines in code:

  1. MAPL_Generic.F90: *"Factories not equal"
  2. MAPL_CubedSphereGridFactory.F90: function physical_params_are_equal
  3. MAPL_CubedSphereGridFactory.F90: subroutine check_and_fill_consistency

@lizziel
Copy link
Contributor

lizziel commented Apr 18, 2024

HI @yuz-wx, here is a quick fix for the stretched grid restart file issue: geoschem/MAPL#34

@kilicomu
Copy link

Hi @yantosca , @lizziel - I think I've found another way that this problem can sneak in!

@amy1916 was seeing the same issue as described by @yuz-wx so we read through this thread and applied the fix that Lizzie made to the MAPL submodule. We were still seeing the same Factories not equal error, and after some investigation we discovered that the number stored in the gcchem_internal_checkpoint file for the TARGET_LON attribute was not quite correct.

To demo this, let's take a stretched grid restart file that was created using GCPy regridding tools (FILE_1) and a checkpoint file produced by GCHP (FILE_2). If we ncdump the files and look at the attributes, we see:

FILE_1:

// global attributes:
                :STRETCH_FACTOR = 2.f ;
                :TARGET_LAT = 40.f ;
                :TARGET_LON = 285.f ;

FILE_2:

// global attributes:
                :STRETCH_FACTOR = 2.f ;
                :TARGET_LAT = 40.f ;
                :TARGET_LON = 285.f ;

All looks good, right?

Except, if we open both using xarray and take a look at the values of the attributes, we get:

In [6]: DATA.attrs['TARGET_LON'] == DATA2.attrs['TARGET_LON']
Out[6]: False

In [7]: DATA.attrs['TARGET_LON']
Out[7]: 285.0

In [8]: DATA2.attrs['TARGET_LON']
Out[8]: 284.99997

Just to sanity check against some kind of printing problem I modified initialize_from_file_metadata in MAPL_CubedSphereGridFactory to check the value of the TARGET_LON that is read in against the expected value and it confirmed that the value read in from the file is incorrect - I don't think any amount of precision changing down the line can guard that!

I haven't had a chance to dig into the checkpoint writing code to figure out why this number was getting into the checkpoint with an error, but something I did notice is that there are type differences between variables in FILE_1 and FILE_2, for example:

In [4]: DATA
Out[4]: 
<xarray.Dataset>
Dimensions:              (lat: 1200, lon: 200, lev: 72, time: 1)
Coordinates:
  * lev                  (lev) float64 1.0 2.0 3.0 4.0 ... 69.0 70.0 71.0 72.0
  * time                 (time) datetime64[ns] 2022-05-01
  * lat                  (lat) float64 0.0 1.0 2.0 ... 1.198e+03 1.199e+03
  * lon                  (lon) float64 0.0 1.0 2.0 3.0 ... 197.0 198.0 199.0
Data variables: (12/354)
    ARCHV_DRY_TOTN       (lat, lon) float32 ...
    ARCHV_WET_TOTN       (lat, lon) float32 ...
    AREA                 (lat, lon) float32 ...
    AeroH2O_SNA          (lev, lat, lon) float32 ...
    BXHEIGHT             (lev, lat, lon) float32 ...
    DELP_DRY             (lev, lat, lon) float32 ...
    ...                   ...
    SPC_pFe              (lev, lat, lon) float32 ...
    STATE_PSC            (lev, lat, lon) float32 ...
    T_DAVG               (lat, lon) float32 ...
    T_PREVDAY            (lat, lon) float32 ...
    TropLev              (lat, lon) float32 ...
    WetDepNitrogen       (lat, lon) float32 ...
Attributes:
    STRETCH_FACTOR:  2.0
    TARGET_LAT:      40.0
    TARGET_LON:      285.0

In [5]: DATA2
Out[5]: 
<xarray.Dataset>
Dimensions:         (lon: 200, lat: 1200, lev: 72, time: 1)
Coordinates:
  * lon             (lon) float64 1.0 2.0 3.0 4.0 ... 197.0 198.0 199.0 200.0
  * lat             (lat) float64 1.0 2.0 3.0 ... 1.198e+03 1.199e+03 1.2e+03
  * lev             (lev) float64 1.0 2.0 3.0 4.0 5.0 ... 69.0 70.0 71.0 72.0
  * time            (time) datetime64[ns] 2022-05-14
Data variables: (12/335)
    ARCHV_DRY_TOTN  (lat, lon) float32 ...
    ARCHV_WET_TOTN  (lat, lon) float32 ...
    AREA            (lat, lon) float64 ...
    AeroH2O_SNA     (lev, lat, lon) float64 ...
    BXHEIGHT        (lev, lat, lon) float64 ...
    DELP_DRY        (lev, lat, lon) float64 ...
    ...              ...
    SPC_pFe         (lev, lat, lon) float64 ...
    STATE_PSC       (lev, lat, lon) float32 ...
    T_DAVG          (lat, lon) float32 ...
    T_PREVDAY       (lat, lon) float32 ...
    TropLev         (lat, lon) float64 ...
    WetDepNitrogen  (lat, lon) float64 ...
Attributes:
    STRETCH_FACTOR:  2.0
    TARGET_LAT:      40.0
    TARGET_LON:      284.99997

which hints to me that somewhere the type of variables going into the checkpoint files is not being made explicit.

If I can find some time I'll get into the checkpoint writing code, but my GEOS-Chem time is quite thin at the moment!

Hope that's helpful, and if you think this would be more appropriate as a separate bug report I'll happily move it.

@lizziel lizziel reopened this May 22, 2024
@lizziel
Copy link
Contributor

lizziel commented May 22, 2024

Thanks for reporting this still isn't working. I'll take a look at a more comprehensive fix. Indeed looks like we need to better track/preserve the type. This may already be fixed in a newer version of MAPL. I'll see what I can come up with.

@lizziel
Copy link
Contributor

lizziel commented Jun 7, 2024

I still have not been able to reproduce the issue. So far I have been able to submit sequential stretched grid runs without hitting the error. I'll try your exact stretch params and see if that makes a difference.

@lizziel
Copy link
Contributor

lizziel commented Jun 12, 2024

I created a stretched grid restart file using your parameters but still cannot reproduce the issue. Besides running for three consecutive simulations I opened each checkpoint and printed the stretch attributes.

import xarray as xr
ds0=xr.open_dataset('GEOSChem.Restart.20190101_0000z.c24.nc4')
ds1=xr.open_dataset('GEOSChem.Restart.20190101_0100z.c24.nc4')
ds2=xr.open_dataset('GEOSChem.Restart.20190101_0200z.c24.nc4')
ds3=xr.open_dataset('GEOSChem.Restart.20190101_0300z.c24.nc4')
ds0.attrs['TARGET_LON']
103.0
ds1.attrs['TARGET_LON']
103.0
ds2.attrs['TARGET_LON']
103.0
ds3.attrs['TARGET_LON']
103.0
ds0.attrs['TARGET_LAT']
37.0
ds1.attrs['TARGET_LAT']
37.0
ds2.attrs['TARGET_LAT']
37.0
ds3.attrs['TARGET_LAT']
37.0

Have you tried using the latest MAPL version used in 14.4? I wonder if there was a mistake if you manually applied the fix.

@kilicomu
Copy link

Hi @lizziel - thanks for trying to reproduce the error. I'm now very confused as to how the stretch parameters have become wrong in the restart file!

I'll chat with @amy1916 and see if I can reliably reproduce the problem on our end.

@lizziel
Copy link
Contributor

lizziel commented Jun 27, 2024

@kilicomu, I just remembered that when this issue came up a long time ago I was not able to reproduce because I was using transport tracer simulation which has no issues. I will try again with fullchem.

@lizziel
Copy link
Contributor

lizziel commented Jul 9, 2024

Following up on this, I'm going to implement a fix from GMAO: GEOS-ESM/MAPL#1979.

@kilicomu
Copy link

kilicomu commented Jul 9, 2024

@lizziel Cool - let me know if / when you want me to test it my end!

@lizziel
Copy link
Contributor

lizziel commented Jul 17, 2024

@kilicomu, I expected to reproduce the problem with a fullchem run but actually I don't see it. Could you try an out-of-the box run with 14.4.1? You can try your own restart as well as the one at http://geoschemdata.wustl.edu/ExtData/GEOSCHEM_RESTARTS/GC_14.3.0/.

@kilicomu
Copy link

Hi @lizziel - sorry it's taking a while to get back to this, pushed for time at the moment. I'll let you know when I can get a test going.

@lizziel
Copy link
Contributor

lizziel commented Jul 29, 2024

@kilicomu, no worries! I am putting this issue on the back-burner until we get confirmation it is still a problem in the latest version. If you still get the error then I will try testing it on other systems since it does not seem to be a problem on the Harvard cluster.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.

@github-actions github-actions bot added the stale No recent activity on this issue label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working stale No recent activity on this issue topic: Stretched Grid Specific to stretched grid simulation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants