Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E3SM-IO failed on 1-process run #37

Open
wkliao opened this issue Aug 30, 2023 · 18 comments
Open

E3SM-IO failed on 1-process run #37

wkliao opened this issue Aug 30, 2023 · 18 comments

Comments

@wkliao
Copy link

wkliao commented Aug 30, 2023

I am using the develop branch of vol-async 73a870d to test E3SM-IO benchmark.
One of the tests failed. The failed command runs on 1 MPI process, but
the same command runs fine with 16 processes.

Below are the related env variables.

HDF5_PLUGIN_PATH=$HOME/ASYNC_VOL/lib
HDF5_VOL_CONNECTOR=async under_vol=0;under_info={}
LD_LIBRARY_PATH=$HOME/ASYNC_VOL/lib:$HOME/Argobots/1.1/lib:$HOME/HDF5/1.14.1-2-thread/lib

Here is the run command.

e3sm_io -k -r 2 -y 2 datasets/map_f_case_16p.h5 -o blob_f_out.h5 -a hdf5 -x blob

Part of GDB trace is given below.

#26 0x00007f717436f218 in H5D__write (count=count@entry=1, dset_info=dset_info@entry=0x7f71565fff00)
    at ../../hdf5-1.14.1-2/src/H5Dio.c:745
#27 0x00007f71745b1f61 in H5VL__native_dataset_write (count=1, obj=<optimized out>, 
    mem_type_id=<optimized out>, mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=<optimized out>, 
    buf=0x191c130, req=0x0) at ../../hdf5-1.14.1-2/src/H5VLnative_dataset.c:407
#28 0x00007f717459db47 in H5VL__dataset_write (cls=<optimized out>, req=0x0, buf=0x191c130, 
    dxpl_id=792633534417207497, file_space_id=0x191b230, mem_space_id=0x1922630, mem_type_id=0x191a430, 
    obj=0x1915350, count=1) at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2236
#29 H5VLdataset_write (count=1, obj=0x1915350, connector_id=648518346341351424, mem_type_id=0x191a430, 
    mem_space_id=0x1922630, file_space_id=0x191b230, dxpl_id=792633534417207497, buf=0x191c130, req=0x0)
    at ../../hdf5-1.14.1-2/src/H5VLcallback.c:2396
#30 0x00007f71725a8ef0 in async_dataset_write_fn (foo=0x1a335a0)
    at /homes/wkliao/ASYNC_VOL/vol-async/src/h5_async_vol.c:9712
#31 0x00007f717238104a in ABTD_ythread_func_wrapper (p_arg=0x7f71566001e0)
    at ../../argobots-1.1/src/arch/abtd_ythread.c:21
@houjun
Copy link
Collaborator

houjun commented Sep 8, 2023

Hi @wkliao, I just tried running e3sm_io with your command on Perlmutter, with the latest vol-async (85c37d4) and HDF5 1.14.2 release, and everything seems to be fine, can you try again with these versions?

@wkliao
Copy link
Author

wkliao commented Sep 9, 2023

The latest 85c37d4 appears to fix the problem. Thanks for the fix.

FYI. Async VOL is constantly tested when E3SM-IO and Log VOL have new commits pushed to their repo.

Any plan to make a new release?

@houjun
Copy link
Collaborator

houjun commented Sep 11, 2023

Yes, I'll do some more testing and release a new version today.

@houjun
Copy link
Collaborator

houjun commented Sep 11, 2023

@wkliao, I just released v1.8, please let me know if you find any issue.

@wkliao
Copy link
Author

wkliao commented Sep 12, 2023

I am getting a test program hanging problem when stacking Log VOL on top of running Cache and Async VOLs without Log VOL. The test program is group.cpp which simply creates 2 HDF5 group objects, and the GitHub action output can be found here. All environment variables used in the test can also be found there.

There is no error message. The test was terminated as it ran out of time.

@wkliao
Copy link
Author

wkliao commented Sep 12, 2023

Just realized that the failed test program was not using Log VOL.
It uses only Cache and Async VOLs. I have revised my previous post accordingly.

@houjun
Copy link
Collaborator

houjun commented Sep 12, 2023

Hi @wkliao, I tried the Log VOL group and other basic tests on Perlmutter and they all ran successfully with Cache and Async VOL. So I'm not sure what went wrong there, can you try running the test again? Is there a verbose mode that can print out where it got stuck?

@wkliao
Copy link
Author

wkliao commented Sep 12, 2023

As this failure happened on the GitHub actions, I suggest to create a new workflow in Async VOL to test group.cpp only. Please use the following software.

   MPICH_VERSION: 4.1.2
   HDF5_VERSION: 1.14.2
   ARGOBOTS_VERSION: 1.1
   ASYNC_VOL_VERSION: 1.8
   Cache VOL: master branch

You can reuse part of the yaml file. Note testing group.cpp requires no Log VOL

@wkliao
Copy link
Author

wkliao commented Sep 13, 2023

I reran the same GitHub workflow again and it failed (hang) at a different test program.
https://github.com/DataLib-ECP/vol-log-based/actions/runs/6159901278/job/16737743501

The test uses the following environment variables. Could you please check whether they are OK.

  export ABT_DIR=${GITHUB_WORKSPACE}/Argobots
  export ASYNC_DIR=${GITHUB_WORKSPACE}/Async
  export CACHE_DIR=${GITHUB_WORKSPACE}/Cache
  export HDF5_DIR=${GITHUB_WORKSPACE}/HDF5
  export HDF5_ROOT=${HDF5_DIR}
  export HDF5_PLUGIN_PATH=${CACHE_DIR}/lib:${ASYNC_DIR}/lib
  export LD_LIBRARY_PATH=${CACHE_DIR}/lib:${ASYNC_DIR}/lib:${ABT_DIR}/lib:${HDF5_DIR}/lib:${LD_LIBRARY_PATH}
  export HDF5_VOL_CONNECTOR="cache_ext config=${GITHUB_WORKSPACE}/cache.cfg;under_vol=512;under_info={under_vol=0;under_info={}}"
  export MPICH_MAX_THREAD_SAFETY=multiple
  export HDF5_USE_FILE_LOCKING=FALSE
  export HDF5_ASYNC_DISABLE_DSET_GET=0
  # Start async execution at file close time
  export HDF5_ASYNC_EXE_FCLOSE=1
  # Start async execution at group close time
  export HDF5_ASYNC_EXE_GCLOSE=1
  # Start async execution at dataset close time
  export HDF5_ASYNC_EXE_DCLOSE=1
  export TEST_NATIVE_VOL_ONLY=1

@houjun
Copy link
Collaborator

houjun commented Sep 13, 2023

The HDF5_ASYNC_EXE_* ones are not necessary but they should be harmless, I'll try setting up an environment the same as the GitHub Actions runner and find out what is causing the hang.

@houjun
Copy link
Collaborator

houjun commented Sep 15, 2023

I have a new vol-async 1.8.1 release which seems to fix the hang issue, however, there are new errors with
Test stacking Log VOL on top of Cache VOL only - make check
Based on the name it doesn't seem to use async vol, so not sure what went wrong.

@wkliao
Copy link
Author

wkliao commented Sep 16, 2023

The error message says Cache VOL requires the test programs to call MPI_Init_thread() instead of MPI_Init(). Is this true for Cache VOL?

 [CACHE_VOL] ERROR: cache VOL requires MPI to             be initialized with MPI_THREAD_MULTIPLE.             Please use MPI_Init_thread
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

@wkliao
Copy link
Author

wkliao commented Sep 20, 2023

The hanging problem re-appeared in E3SM-IO.
It happened when using Async I/O 1.8.1 + Cache VOLs, without Log VOL.
I ran it twice. First occurred at G case and second I case.
See GitHub action log at
https://github.com/Parallel-NetCDF/E3SM-IO/actions/runs/6242093110

@houjun
Copy link
Collaborator

houjun commented Sep 20, 2023

I think Huihuo has been updating Cache VOL actively, probably better for the E3SM-IO tests to use the release version.
@zhenghh04, do you see the hanging problem with your tests? Is this related to the group and file close issue we talked about yesterday?

@wkliao
Copy link
Author

wkliao commented Sep 20, 2023

Currently, there is no release versions in Cache VOL. I have made a request, see HDFGroup/vol-cache#22.

@zhenghh04
Copy link
Contributor

@wkliao if you like, you can try the previous v1.2 release: https://github.com/hpc-io/vol-cache/releases/tag/v1.2.

I'll push a new release soon.

@zhenghh04
Copy link
Contributor

I think Huihuo has been updating Cache VOL actively, probably better for the E3SM-IO tests to use the release version. @zhenghh04, do you see the hanging problem with your tests? Is this related to the group and file close issue we talked about yesterday?

I see hang issue with F case. Basically, it stops at H5VLfile_close call.

@wkliao
Copy link
Author

wkliao commented Sep 20, 2023

Hi @zhenghh04
I can see 3 tags and 3 pre-releases in Cache VOL.
You can actually make 1.2 an official releases, before making release 1.3.
I suggest to also make tags 1.0 and 1.1 official releases, which will make the release history looks formal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants