Preload small datasets to memory in Custom dataset #235

RaulPPelaez · 2023-10-11T15:24:51Z

While replicating results from the torchmd-protein-thermodynamics repository, I experienced sluggish training speeds and low GPU usage (meaning sitting at 0% and briefly going to 100% at each iteration) using the following configuration file:

activation: tanh
batch_size: 1024
inference_batch_size: 1024
dataset: Custom
coord_files: "chignolin_ca_coords.npy"
embed_files: "chignolin_ca_embeddings.npy"
force_files: "chignolin_ca_deltaforces.npy"
cutoff_upper: 12.0
cutoff_lower: 3.0
derivative: true
early_stopping_patience: 30
embedding_dimension: 128
lr: 0.0005
lr_factor: 0.8
lr_min: 1.0e-06
lr_patience: 10
lr_warmup_steps: 0
model: graph-network
neighbor_embedding: false
ngpus: -1
num_epochs: 100
num_layers: 4
num_nodes: 1
num_rbf: 18
num_workers: 4
rbf_type: expnorm
save_interval: 2
seed: 1
test_interval: 2
test_size: 0.1
trainable_rbf: true
val_size: 0.05
weight_decay: 0.0

The referenced files take approx 300mb.
Playing around with num_workers and batch size did not help.
Upon investigation, the issue arose from the Custom dataset's I/O-bound get method, which reads from disk every time it is invoked, causing the low GPU usage and slow training.

To resolve this, I implemented a preloading feature that loads the complete dataset into system memory if its size is below a user-configurable threshold, set by default at 1GB. The data is stored as PyTorch tensors, facilitating compatibility with multi-threaded data loaders (num_workers). Notably, this approach does not inflate RAM usage when increasing the number of workers.

This optimization led to a x20 speedup in training time for this specific setup.

OTOH I tweaked the DataLoader options a bit.

RaulPPelaez · 2023-10-11T15:33:55Z

@AntonioMirarchi please review!

peastman · 2023-10-11T18:55:45Z

The Custom dataset class is incredibly inefficient. It reloads the whole file every time get() is called to retrieve a single sample. Some intelligent caching would help. But a much better choice is to use the HDF5 dataset class. It is far more efficient.

peastman · 2023-10-11T18:57:28Z

In fact, possibly we should just make Custom create a temporary HDF5 file on startup and then load from it.

…Dataset

RaulPPelaez · 2023-10-13T14:19:50Z

I really like this idea Peter, thanks!
I do not like Custom at all either, but it offers some convenience/simplicity that makes people choose it so I think it is worth improving.
I implemented so that the user can instruct Custom to use HDF5 under the hood, lets see how it goes.

Precompute as much as possible by storing mmap'ed files.

torchmdnet/datasets/custom.py

giadefa · 2023-10-16T09:10:56Z

h5py are slower than mmap arrays. Maybe I am missing the narrative but why are we doing something different from what we are already doing in other dataloaders? g

…

On Mon, Oct 16, 2023 at 11:05 AM Antonio Mirarchi ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In torchmdnet/datasets/custom.py <#235 (comment)>: > + with h5py.File(hdf5_dataset, "w") as f: + for i in range(len(files["pos"])): + # Create a group for each file + coord_data = np.load(files["pos"][i]) + embed_data = np.load(files["z"][i]).astype(int) + group = f.create_group(str(i)) + num_samples = coord_data.shape[0] + group["pos"] = coord_data + group["types"] = np.tile(embed_data, (num_samples, 1)) + if "y" in files: + energy_data = np.load(files["y"][i]) + group["energy"] = energy_data + if "neg_dy" in files: + force_data = np.load(files["neg_dy"][i]) + group["forces"] = force_data It's just the "proper way", in theory you should be able to move across path in the hdf5 file (it's not this case) . For example if you use "create_dataset" you can retrieve the pos dataset using: f = h5py.File(hdf5_dataset, "r")f["0/pos"] While if you use the "dictionary way" you get this error: KeyError: "Unable to open object (object 'pos' doesn't exist)" — Reply to this email directly, view it on GitHub <#235 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOUKJXJDPK7SBVMBLRTX7T2FNANCNFSM6AAAAAA54GA4WI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

RaulPPelaez · 2023-10-16T09:18:11Z

This PR exists because Custom was calling np.load() at each call of get(). Even with mmap mode this was really slowing down trainings (it is I/O bound in the end...).
I changed it so that:
1- If the dataset is small enough just load it all into RAM
2- Otherwise store the references to the mmap arrays instead of reloading the file each time.
Alternatively, I added an option to transform the Custom files to HDF5, which seems to be just a little bit slower than mmap.
I went ahead and also implemented the same load to RAM functionality in HDF5.

torchmdnet/datasets/custom.py

Move write_as_hdf5 to utils.py

torchmdnet/data.py

RaulPPelaez · 2023-10-27T09:14:57Z

This is ready. @stefdoerr could you please review?

tests/test_datasets.py

torchmdnet/data.py

torchmdnet/datasets/custom.py

stefdoerr · 2023-10-27T11:24:48Z

I think @raimis should review this since he worked on mmaps before. I am not too qualified, so I just commented on style and minor bugs

RaulPPelaez · 2023-11-14T15:27:45Z

I would like to merge this one before next release, @raimis could you take a look? Thanks

guillemsimeon · 2023-12-13T22:27:42Z

is this ready to be shipped?

RaulPPelaez · 2023-12-14T06:46:05Z

Yes!

a dataset that provides fp64)

guillemsimeon

perf

guillemsimeon · 2024-01-23T07:48:29Z

we should merge these

RaulPPelaez added 6 commits October 11, 2023 17:05

Preload small datasets to RAM in Custom

1ce8703

Add test for memory preload in Custom

4275520

Add option to control dataset preload threshold in train.py

a62e09b

Tweak data module

45a9300

Blacken

2cbc9c4

Fix docstring

cff114b

RaulPPelaez requested a review from raimis October 11, 2023 15:26

RaulPPelaez added 2 commits October 13, 2023 16:13

Check shapes in Custom test

9536f94

Add option to Custom to write an hdf5 dataset and switch to the HDF5 …

5c527ae

…Dataset

RaulPPelaez added 5 commits October 14, 2023 18:45

Fix docstring and argument check in HDF5

43d0dac

Implement cache system in HDF5

4380fe6

Improve Custom Dataset performance.

90ae6c2

Precompute as much as possible by storing mmap'ed files.

More in depth dataset tests

f9727af

Add comment

ab30bbb

AntonioMirarchi reviewed Oct 16, 2023

View reviewed changes

torchmdnet/datasets/custom.py Outdated Show resolved Hide resolved

AntonioMirarchi reviewed Oct 16, 2023

View reviewed changes

torchmdnet/datasets/custom.py Outdated Show resolved Hide resolved

RaulPPelaez added 3 commits October 16, 2023 11:00

Remove unnecessary variable

c0296f3

add mmap_mode="r" when converting to HDF5

90e937c

Proper way to add data to converted HDF5 files

caab025

AntonioMirarchi reviewed Oct 16, 2023

View reviewed changes

torchmdnet/datasets/custom.py Show resolved Hide resolved

RaulPPelaez added 3 commits October 16, 2023 14:28

Ensure only rank 0 writes converted HDF5 file in Custom.

4ae2546

Merge remote-tracking branch 'origin/main' into custom_dataset

5d29c04

Remove read_as_hdf5 option from Custom.

49bfac5

Move write_as_hdf5 to utils.py

AntonioMirarchi reviewed Oct 26, 2023

View reviewed changes

torchmdnet/data.py Outdated Show resolved Hide resolved

RaulPPelaez added 3 commits October 27, 2023 11:07

Remove incorrect argument

9076a42

Remove unused imports

352d1f1

Merge remote-tracking branch 'origin/main' into custom_dataset

b0f6e04

stefdoerr reviewed Oct 27, 2023

View reviewed changes

tests/test_datasets.py Outdated Show resolved Hide resolved

stefdoerr reviewed Oct 27, 2023

View reviewed changes

torchmdnet/data.py Outdated Show resolved Hide resolved

stefdoerr reviewed Oct 27, 2023

View reviewed changes

torchmdnet/datasets/custom.py Outdated Show resolved Hide resolved

stefdoerr reviewed Oct 27, 2023

View reviewed changes

torchmdnet/datasets/custom.py Outdated Show resolved Hide resolved

RaulPPelaez added 4 commits October 27, 2023 13:36

Fix indentation error

096fb6e

Fix typo in HDF5

7ed2e91

Use f-strings

f480e43

Fix typo

d359155

Merge remote-tracking branch 'origin/main' into custom_dataset

1287249

Merge remote-tracking branch 'origin/main' into custom_dataset

1e9bcca

guillemsimeon assigned guillemsimeon and unassigned guillemsimeon Dec 28, 2023

Merge remote-tracking branch 'origin/main' into custom_dataset

592ca86

raimis removed their request for review January 16, 2024 14:39

RaulPPelaez added 2 commits January 18, 2024 12:21

Always cast dataset to the chosen float type (allows to train in fp32

5ab3aed

a dataset that provides fp64)

Merge remote-tracking branch 'origin/main' into custom_dataset

e3bb983

guillemsimeon approved these changes Jan 18, 2024

View reviewed changes

RaulPPelaez merged commit 140ca66 into torchmd:main Jan 23, 2024
2 checks passed

RaulPPelaez deleted the custom_dataset branch January 31, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preload small datasets to memory in Custom dataset #235

Preload small datasets to memory in Custom dataset #235

RaulPPelaez commented Oct 11, 2023

RaulPPelaez commented Oct 11, 2023

peastman commented Oct 11, 2023

peastman commented Oct 11, 2023

RaulPPelaez commented Oct 13, 2023

giadefa commented Oct 16, 2023 via email

RaulPPelaez commented Oct 16, 2023

RaulPPelaez commented Oct 27, 2023

stefdoerr commented Oct 27, 2023

RaulPPelaez commented Nov 14, 2023

guillemsimeon commented Dec 13, 2023

RaulPPelaez commented Dec 14, 2023

guillemsimeon left a comment

guillemsimeon commented Jan 23, 2024

Preload small datasets to memory in Custom dataset #235

Preload small datasets to memory in Custom dataset #235

Conversation

RaulPPelaez commented Oct 11, 2023

RaulPPelaez commented Oct 11, 2023

peastman commented Oct 11, 2023

peastman commented Oct 11, 2023

RaulPPelaez commented Oct 13, 2023

giadefa commented Oct 16, 2023 via email

RaulPPelaez commented Oct 16, 2023

RaulPPelaez commented Oct 27, 2023

stefdoerr commented Oct 27, 2023

RaulPPelaez commented Nov 14, 2023

guillemsimeon commented Dec 13, 2023

RaulPPelaez commented Dec 14, 2023

guillemsimeon left a comment

Choose a reason for hiding this comment

guillemsimeon commented Jan 23, 2024