Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor performance on slingshot11 #72

Open
jedwards4b opened this issue Sep 13, 2023 · 6 comments
Open

poor performance on slingshot11 #72

jedwards4b opened this issue Sep 13, 2023 · 6 comments
Labels

Comments

@jedwards4b
Copy link
Contributor

Using the cesm model in a coupler test configuration
PFS.ne120_t12.2000_XATM_XLND_XICE_XOCN_XROF_SGLC_SWAV.derecho_intel
We are observing very poor performance of mct_rearrange_rearr on machines perlmutter (NERSC) and derecho (NCAR) - both machines use slingshot11 network and AMD processor.
Using 512 tasks on derecho with gptl timing we see
"mct_rearrange_rearr" - 512 512 4.426752e+06 1.391128e+05 277.198 ( 268 0) 263.345 ( 505 0)

Comparing to the ncar cheyenne system:
"mct_rearrange_rearr" - 512 512 4.426752e+06 3.399975e+04 73.911 ( 414 0) 60.767 ( 384 0)

@rljacob
Copy link
Contributor

rljacob commented Sep 18, 2023

Noting that a similar performance difference is seen between perlmutter and chrysalis (an AMD machine with infiniband) for E3SM cases. (Haven't tried the exact case above yet).

@rljacob rljacob added the mctsrc label Sep 18, 2023
@jedwards4b
Copy link
Contributor Author

I just tried the X case on derecho with the cray compiler and I am not seeing the poor performance -
rearrange_rearr max 46.8 min 40.4 (cray compiler 15.0.1)
  max 642.257   min 445.713 (intel compiler 2023.0.0)

@rljacob
Copy link
Contributor

rljacob commented Sep 19, 2023

Is the mpi library different?

@jedwards4b
Copy link
Contributor Author

It's the same mpi library, cray-mpich/8.1.25, however I note that there is a different build of this library for each compiler flavor.

@rljacob
Copy link
Contributor

rljacob commented Oct 25, 2023

Updating this issue: some hardware updates on NERSC made a lot of the observed behavior go away. @ndkeen can say more.

@ndkeen
Copy link

ndkeen commented Oct 25, 2023

On Sep28th maintenance, there were some updates (BIOS, network, SW). And indeed I see improvements in several places -- mostly in communication at higher node counts on pm-cpu.
cpl_720

Where, in the plot, c1 refers to normal/default PSTRID of 1, and c8 is the work-around we were using of CPL_PSTRID=8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants