Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver-portable walker logging #5019

Merged
merged 57 commits into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
e36d329
first working impl of TraceManager copy
jtkrogel May 20, 2024
abf1634
remove unused functions
jtkrogel May 21, 2024
8a885a1
remove combined trace use
jtkrogel May 21, 2024
ca70bc8
remove combined traces entirely
jtkrogel May 21, 2024
5b8072b
start trace collector
jtkrogel May 21, 2024
d21b2ce
hot swap trace collector for trace manager
jtkrogel May 21, 2024
1545230
remove unused code
jtkrogel May 21, 2024
875ee62
remove buffers from master
jtkrogel May 21, 2024
e0fcd9b
remove unused data members
jtkrogel May 21, 2024
6e1ad00
fix dmc walker traces
jtkrogel May 21, 2024
d4edf72
consolidate state
jtkrogel May 21, 2024
6aeb13c
operable buffers
jtkrogel May 24, 2024
b0b9175
wtrace buffers work
jtkrogel May 24, 2024
a1b48c2
roll tests over to walker traces
jtkrogel May 28, 2024
586c7db
remove old scaffolding
jtkrogel May 28, 2024
6fdcae6
cleaning
jtkrogel May 28, 2024
77bfcbd
update trace collector names
jtkrogel May 28, 2024
c50cbfc
walker trace manager names
jtkrogel May 28, 2024
4c881ef
walker trace manager names, continued
jtkrogel May 28, 2024
dceab3a
revoke prejudice
jtkrogel May 28, 2024
e7544be
migrate to cpp, cleanup includes
jtkrogel May 29, 2024
75f2aa1
implementation of min/max/median energy walker data buffering
jtkrogel May 29, 2024
ecfdca8
fix min/max/med buffering, works now
jtkrogel May 30, 2024
6c9c7f4
cleanup api
jtkrogel May 30, 2024
196a4b4
enable selective data write
jtkrogel May 30, 2024
00986dc
banish put()
jtkrogel May 30, 2024
4bb5275
cleanup
jtkrogel May 30, 2024
3b92dc8
fix header
jtkrogel May 30, 2024
9d7128a
cleanup api
jtkrogel May 30, 2024
61b0f3c
port to batched drivers
jtkrogel May 30, 2024
f5c2224
add batched tests
jtkrogel May 30, 2024
94a8d15
formatting
jtkrogel May 31, 2024
af77214
batched driver member naming convenction
jtkrogel May 31, 2024
23e1c4a
move trace collection to crowd scope
jtkrogel May 31, 2024
89fe362
const in collect
jtkrogel May 31, 2024
085c7b3
input at driver level
jtkrogel Jun 3, 2024
b042ff9
WalkerTraceBuffer as a class
jtkrogel Jun 3, 2024
fe47abd
camelize
jtkrogel Jun 3, 2024
921411f
break into multiple files
jtkrogel Jun 3, 2024
ea9a8e6
vacuous type class
jtkrogel Jun 3, 2024
53b43c6
add in-code comments
jtkrogel Jun 3, 2024
dc01863
unit test for WalkerTraceCollector
jtkrogel Jun 3, 2024
8eb6e06
add manual entry
jtkrogel Jun 4, 2024
9491933
better guards
jtkrogel Jun 4, 2024
d8745a2
fix unit test for complex
jtkrogel Jun 4, 2024
57cdaae
fix index error
jtkrogel Jun 5, 2024
8426fa1
leave no trace
jtkrogel Jun 6, 2024
22ea904
one more
jtkrogel Jun 6, 2024
ba7450d
Merge remote-tracking branch 'origin/develop' into pr/origin-5019
ye-luo Jun 11, 2024
157dda3
Correct CMake and include.
ye-luo Jun 11, 2024
c792aff
rename step counter
jtkrogel Jun 11, 2024
4a7d770
longer name for crowd.collect()
jtkrogel Jun 11, 2024
35871ac
hide data members behind more member functions
jtkrogel Jun 11, 2024
740cf6b
simplify WalkerLogState
jtkrogel Jun 11, 2024
be01ba4
Relocate wlog_collector_.startBlock
ye-luo Jun 11, 2024
a832b7c
Make wlog_collector_ private in Crowd.
ye-luo Jun 11, 2024
9408943
Merge branch 'develop' into walker_traces
ye-luo Jun 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
247 changes: 247 additions & 0 deletions docs/methods.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1904,4 +1904,251 @@ declaration to ensure correct sampling:
a new all-electron configuration, at which point the action is
computed and the move is either accepted or rejected.



.. _walker_logging

Walker Data Logging
===================

Detailed per-walker information can be written to HDF5 files for VMC or DMC by
including the <walkerlogs/> XML element. This includes the LocalEnergy and
its components for each walker from each MC step. By default, more detailed
particle-level information (e.g. electron coordinates) is also written for the
lowest, highest, and median energy walkers at each MC step (modest disk usage).
Optionally, particle-level information can also be written for all walkers,
potentially requiring a huge amount of disk space.

**Input specification**

The default walker data logging functionality is enabled by including the
<walkerlogs/> XML element (once) just before the QMC driver sections,
for example:

::

<walkerlogs/>
<qmc method="vmc" move="pbyp">
<parameter name="walkers_per_rank"> 256 </parameter>
<parameter name="warmupSteps"> 100 </parameter>
<parameter name="blocks"> 200 </parameter>
<parameter name="steps"> 10 </parameter>
<parameter name="substeps"> 3 </parameter>
<parameter name="timestep"> 0.3 </parameter>
<parameter name="usedrift"> yes </parameter>
</qmc>
<qmc method="dmc" move="pbyp" target="e">
<parameter name="walkers_per_rank"> 256 </parameter>
<parameter name="warmupsteps"> 40 </parameter>
<parameter name="blocks"> 800 </parameter>
<parameter name="steps"> 20 </parameter>
<parameter name="timestep"> 0.01 </parameter>
</qmc>



Optional XML attributes enable finer control over the behavior:

.. table::

+------------------+--------------+--------------+-------------+----------------------------------------------------+
| **Name** | **Datatype** | **Values** | **Default** | **Description** |
+==================+==============+==============+=============+====================================================+
| ``step_period`` | integer | :math:`> 0` | 1 | Collect walker data every step_period MC steps |
+------------------+--------------+--------------+-------------+----------------------------------------------------+
| ``particle`` | text | yes,no | no | Write particle data for all walkers |
+------------------+--------------+--------------+-------------+----------------------------------------------------+
| ``quantiles`` | text | yes,no | yes | Write full data for min/max/median energy walkers |
+------------------+--------------+--------------+-------------+----------------------------------------------------+
| ``min`` | text | yes,no | yes | Enable/disable write for min energy walker data |
+------------------+--------------+--------------+-------------+----------------------------------------------------+
| ``max`` | text | yes,no | yes | Enable/disable write for max energy walker data |
+------------------+--------------+--------------+-------------+----------------------------------------------------+
| ``median`` | text | yes,no | yes | Enable/disable write for median energy walker data |
+------------------+--------------+--------------+-------------+----------------------------------------------------+
| ``verbose`` | text | yes,no | no | Write more log file information |
+------------------+--------------+--------------+-------------+----------------------------------------------------+


Additional information:

- ``step_period``: By default, data for each walker is collected every MC
step, corresponding to step_period=1. A sub-sampling of the walker
data may be obtained instead by setting step_period>1. For example,
with step_period=5, walker data is collected every 5th MC step.

- ``particle``: This controls whether per-particle data is written to
the walker log HDF files along with scalar walker properties. These data
comprise: electron coordinates, spin coordinates (spinor runs only),
per-particle wavefunction gradients, and per-particle wavefunction
laplacian values.

- ``quantiles``: Write out full (scalar and per-particle) data for walkers
at specific quantiles of the local energy distribution. Currently,
these quantiles are the minimum, maximum, and median.

- ``min``: Selectively disable writing data for the minimum energy
walkers. Active only if qtiles=yes.

- ``max``: Selectively disable writing data for the maximum energy
walkers. Active only if qtiles=yes.

- ``median``: Selectively disable writing data for the median energy
walkers. Active only if qtiles=yes.

- ``verbose``: If "yes", write function-call information related to
the walker logging functionality. This option is mainly intended
for developers, as it is of little use in practical runs.


**Output files**

The HDF5 files created by the walker logging functionality have the extension \*.wlogs.h5.
For each VMC or DMC section, one of these files is written for every MPI rank in the run.

For the example XML inputs shown above, QMCPACK run on 6 MPI ranks would produce (at least)
the following output data files:

::

qmc.s000.scalar.dat
qmc.s000.stat.h5
qmc.s000.p000.wlogs.h5
qmc.s000.p001.wlogs.h5
qmc.s000.p002.wlogs.h5
qmc.s000.p003.wlogs.h5
qmc.s000.p004.wlogs.h5
qmc.s000.p005.wlogs.h5

qmc.s001.scalar.dat
qmc.s001.dmc.dat
qmc.s001.stat.h5
qmc.s001.p000.wlogs.h5
qmc.s001.p001.wlogs.h5
qmc.s001.p002.wlogs.h5
qmc.s001.p003.wlogs.h5
qmc.s001.p004.wlogs.h5
qmc.s001.p005.wlogs.h5


A single wlogs.h5 file has several walker data buffers (names with underscores below):

::

# scalar (int/real) data for all walkers
walker_property_int walker_property_real

# scalar and per-particle data for min energy walkers
wmin_property_int wmin_property_real wmin_particle_real

# scalar and per-particle data for max energy walkers
wmax_property_int wmax_property_real wmax_particle_real

# scalar and per-particle data for median energy walkers
wmed_property_int wmed_property_real wmed_particle_real


Each data buffer contains packed walker data in the form of a large 2D array ("data" below):

::

>h5ls qmc.s000.p000.wlogs.h5/walker_property_int
data Dataset {512000/Inf, 4}
data_layout Group

>h5ls qmc.s000.p000.wlogs.h5/walker_property_real
data Dataset {512000/Inf, 15}
data_layout Group


Each row in the 2D data array/buffer contains data for a single walker at a single MC step.
In this case, 256 walkers were advanced through 200\*10=2000 steps for 512000 row entries total.

The location of each particular walker quantity in each row is listed in "data_layout":

::

>h5ls qmc.s000.p000.wlogs.h5/walker_property_int/data_layout
id Group # unique walker id
parent_id Group # id of parent (DMC branching)
step Group # MC step number
age Group # walker "age"

>h5ls qmc.s000.p000.wlogs.h5/walker_property_real/data_layout
weight Group # statistical weight of the walker
LocalEnergy Group # the local (total) energy
Kinetic Group # kinetic energy
LocalPotential Group # full potential energy (all terms)
ElecElec Group # electron-electron energy
LocalECP Group # energy for local channel of ECP
NonLocalECP Group # energy for non-local channels of ECP
logpsi Group # log of wavefunction modulus
phase Group # wavefunction phase
dlogpsi2 Group # squared gradient of wavefunction log-modulus
dphase2 Group # squared gradient of wavefunction phase
dr_node_min Group # estimate of min distance to wfn node along any dimension
multiplicity Group # branching multiplicity (DMC only)
R2Accepted Group # average diffusion of accepted MC moves
R2Proposed Group # average diffusion of proposed MC moves

From this we can see, e.g., that the value for the MC "step" is stored at column
index 0 in walker_property_int/data and the LocalEnergy is stored at column index 6
in walker_property_real/data:

::

>h5ls -d qmc.s000.p000.wlogs.h5/walker_property_int/data_layout/step/index_start
index_start Dataset {SCALAR}
Data:
(0) 0

>h5ls -d qmc.s000.p000.wlogs.h5/walker_property_real/data_layout/LocalEnergy/index_start
index_start Dataset {SCALAR}
Data:
(0) 6


The per-particle data is arranged similarly:

::

>h5ls -d qmc_log_dmc_legacy.s000.p000.wlogs.h5/wmin_particle_real/data_layout
R Group # electron coordinates
G Group # wavefunction gradient
L Group # wavefunction laplacian (per-particle)


However, more information is required in the data_layout to fully specify the location and
shape of the particle-level array data (simplified view for a run with 8 electrons and a
real-valued wavefunction):

::

>h5ls -d qmc.s000.p000.wlogs.h5/wmin_particle_real/data_layout/R
index_start 0 # data starts at column index 0
index_end 24 # data ends at column index 24
dimension 2 # array is 2-dimensional
size 24 # array has 24 elements total
shape 8, 3, 0, 0 # array has shape 8x3
unit_size 1 # each unit of data stored as 1 real value

>h5ls -d qmc.s000.p000.wlogs.h5/wmin_particle_real/data_layout/G
index_start 24 # data starts at column index 24
index_end 48 # data ends at column index 48
dimension 2 # array is 2-dimensional
size 24 # array has 24 elements total
shape 8, 3, 0, 0 # array has shape 8x3
unit_size 1 # data stored as single real values (2 if complex)

>h5ls -d qmc.s000.p000.wlogs.h5/wmin_particle_real/data_layout/L
index_start 48 # data starts at column index 48
index_end 56 # data ends at column index 56
dimension 1 # array is 1-dimensional
size 8 # array has 8 elements total
shape 8, 0, 0, 0 # array has linear shape, length 8
unit_size 1 # data stored as single real values (2 if complex)




.. bibliography:: /bibs/methods.bib
2 changes: 1 addition & 1 deletion src/Estimators/tests/test_trace_manager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
//
// Copyright (c) 2018 Jeongnim Kim and QMCPACK developers.
//
// File developed by: Mark Dewing, mdewin@anl.gov, Argonne National Laboratory
// File developed by: Mark Dewing, mdewing@anl.gov, Argonne National Laboratory
//
// File created by: Mark Dewing, [email protected], Argonne National Laboratory
//////////////////////////////////////////////////////////////////////////////////////
Expand Down
8 changes: 7 additions & 1 deletion src/QMCApp/QMCMain.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,8 @@ QMCMain::QMCMain(Communicate* c)
psi_pool_(std::make_unique<WaveFunctionPool>(my_project_.getRuntimeOptions(), *particle_set_pool_, myComm)),
ham_pool_(std::make_unique<HamiltonianPool>(*particle_set_pool_, *psi_pool_, myComm)),
qmc_system_(nullptr),
first_qmc_(true)
first_qmc_(true),
walker_logs_xml_(NULL)
#if !defined(REMOVE_TRACEMANAGER)
,
traces_xml_(NULL)
Expand Down Expand Up @@ -479,6 +480,10 @@ bool QMCMain::validateXML()
traces_xml_ = cur;
}
#endif
else if (cname == "walkerlogs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse it when it is found into an input class. See estimator manager input handling.

{
walker_logs_xml_ = cur;
}
else
{
//everything else goes to m_qmcaction
Expand Down Expand Up @@ -624,6 +629,7 @@ bool QMCMain::runQMC(xmlNodePtr cur, bool reuse)
#if !defined(REMOVE_TRACEMANAGER)
qmc_driver->putTraces(traces_xml_);
#endif
qmc_driver->putWalkerLogs(walker_logs_xml_);
{
ScopedTimer qmc_run_timer(createGlobalTimer(qmc_driver->getEngineName(), timer_level_coarse));
Timer process_and_run;
Expand Down
3 changes: 3 additions & 0 deletions src/QMCApp/QMCMain.h
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,9 @@ class QMCMain : public MPIObjectBase, public QMCAppBase
///xml mcwalkerset read-in elements
std::vector<xmlNodePtr> walker_set_in_;

///walkerlogs xml
xmlNodePtr walker_logs_xml_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do something like

std::optional<WalkerLogInput> walker_log_input_;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like storing xml, parsing the xml in-place and store the result seems cleaner. We can make some improvements with a later PR.


///traces xml
xmlNodePtr traces_xml_;

Expand Down
2 changes: 2 additions & 0 deletions src/QMCDrivers/CloneManager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#else
using TraceManager = int;
#endif
#include "WalkerLogManager.h"

//comment this out to use only method to clone
#define ENABLE_CLONE_PSI_AND_H
Expand Down Expand Up @@ -86,6 +87,7 @@ CloneManager::~CloneManager()
#if !defined(REMOVE_TRACEMANAGER)
delete_iter(traceClones.begin(), traceClones.end());
#endif
delete_iter(wlog_collectors.begin(), wlog_collectors.end());
}

void CloneManager::makeClones(MCWalkerConfiguration& w, TrialWaveFunction& psi, QMCHamiltonian& ham)
Expand Down
2 changes: 2 additions & 0 deletions src/QMCDrivers/CloneManager.h
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ class CloneManager : public QMCTraits
std::vector<EstimatorManagerBase*> estimatorClones;
///trace managers
std::vector<TraceManager*> traceClones;
///trace collectors
std::vector<WalkerLogCollector*> wlog_collectors;

//for correlated sampling.
static std::vector<UPtrVector<MCWalkerConfiguration>> WPoolClones_uptr;
Expand Down
9 changes: 9 additions & 0 deletions src/QMCDrivers/Crowd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include "Crowd.h"
#include "QMCHamiltonians/QMCHamiltonian.h"


namespace qmcplusplus
{
Crowd::Crowd(EstimatorManagerNew& emb,
Expand Down Expand Up @@ -81,8 +82,16 @@ void Crowd::startBlock(int num_steps)
// VMCBatched does no nonlocal moves
n_nonlocal_accept_ = 0;
estimator_manager_crowd_.startBlock(num_steps);
wlog_collector_.startBlock();
}

void Crowd::stopBlock() { estimator_manager_crowd_.stopBlock(); }

void Crowd::collectStepWalkerLog(int current_step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logWalkers(int current_step)
Seems to say everything necessary.

{
for (int iw = 0; iw < size(); ++iw)
wlog_collector_.collect(mcp_walkers_[iw], walker_elecs_[iw], walker_twfs_[iw], walker_hamiltonians_[iw], current_step);
}


} // namespace qmcplusplus
8 changes: 8 additions & 0 deletions src/QMCDrivers/Crowd.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,14 @@
#include "MultiWalkerDispatchers.h"
#include "DriverWalkerTypes.h"
#include "Estimators/EstimatorManagerCrowd.h"
#include "WalkerLogManager.h"

namespace qmcplusplus
{
// forward declaration
class ResourceCollection;
class EstimatorManagerNew;
class WalkerLogCollector;

/** Driver synchronized step context
*
Expand Down Expand Up @@ -83,6 +85,9 @@ class Crowd
estimator_manager_crowd_.accumulate(mcp_walkers_, walker_elecs_, walker_twfs_, walker_hamiltonians_, rng);
}

/// Collect walker log data
void collectStepWalkerLog(int current_step);

void setRNGForHamiltonian(RandomBase<FullPrecRealType>& rng);

auto beginWalkers() { return mcp_walkers_.begin(); }
Expand All @@ -98,6 +103,7 @@ class Crowd
const RefVector<QMCHamiltonian>& get_walker_hamiltonians() const { return walker_hamiltonians_; }

const EstimatorManagerCrowd& get_estimator_manager_crowd() const { return estimator_manager_crowd_; }
WalkerLogCollector& getWalkerLogCollector() { return wlog_collector_; }

DriverWalkerResourceCollection& getSharedResource() { return driverwalker_resource_collection_; }

Expand Down Expand Up @@ -129,6 +135,8 @@ class Crowd
DriverWalkerResourceCollection driverwalker_resource_collection_;
/// per crowd estimator manager
EstimatorManagerCrowd estimator_manager_crowd_;
// collector for walker logs
WalkerLogCollector wlog_collector_;

/** @name Step State
*
Expand Down
Loading
Loading