Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preload small datasets to memory in Custom dataset #235

Merged
merged 31 commits into from
Jan 23, 2024

Conversation

RaulPPelaez
Copy link
Collaborator

While replicating results from the torchmd-protein-thermodynamics repository, I experienced sluggish training speeds and low GPU usage (meaning sitting at 0% and briefly going to 100% at each iteration) using the following configuration file:

activation: tanh
batch_size: 1024
inference_batch_size: 1024
dataset: Custom
coord_files: "chignolin_ca_coords.npy"
embed_files: "chignolin_ca_embeddings.npy"
force_files: "chignolin_ca_deltaforces.npy"
cutoff_upper: 12.0
cutoff_lower: 3.0
derivative: true
early_stopping_patience: 30
embedding_dimension: 128
lr: 0.0005
lr_factor: 0.8
lr_min: 1.0e-06
lr_patience: 10
lr_warmup_steps: 0
model: graph-network
neighbor_embedding: false
ngpus: -1
num_epochs: 100
num_layers: 4
num_nodes: 1
num_rbf: 18
num_workers: 4
rbf_type: expnorm
save_interval: 2
seed: 1
test_interval: 2
test_size: 0.1
trainable_rbf: true
val_size: 0.05
weight_decay: 0.0

The referenced files take approx 300mb.
Playing around with num_workers and batch size did not help.
Upon investigation, the issue arose from the Custom dataset's I/O-bound get method, which reads from disk every time it is invoked, causing the low GPU usage and slow training.

To resolve this, I implemented a preloading feature that loads the complete dataset into system memory if its size is below a user-configurable threshold, set by default at 1GB. The data is stored as PyTorch tensors, facilitating compatibility with multi-threaded data loaders (num_workers). Notably, this approach does not inflate RAM usage when increasing the number of workers.

This optimization led to a x20 speedup in training time for this specific setup.

OTOH I tweaked the DataLoader options a bit.

@RaulPPelaez
Copy link
Collaborator Author

@AntonioMirarchi please review!

@peastman
Copy link
Collaborator

The Custom dataset class is incredibly inefficient. It reloads the whole file every time get() is called to retrieve a single sample. Some intelligent caching would help. But a much better choice is to use the HDF5 dataset class. It is far more efficient.

@peastman
Copy link
Collaborator

In fact, possibly we should just make Custom create a temporary HDF5 file on startup and then load from it.

@RaulPPelaez
Copy link
Collaborator Author

I really like this idea Peter, thanks!
I do not like Custom at all either, but it offers some convenience/simplicity that makes people choose it so I think it is worth improving.
I implemented so that the user can instruct Custom to use HDF5 under the hood, lets see how it goes.

@giadefa
Copy link
Contributor

giadefa commented Oct 16, 2023 via email

@RaulPPelaez
Copy link
Collaborator Author

This PR exists because Custom was calling np.load() at each call of get(). Even with mmap mode this was really slowing down trainings (it is I/O bound in the end...).
I changed it so that:
1- If the dataset is small enough just load it all into RAM
2- Otherwise store the references to the mmap arrays instead of reloading the file each time.
Alternatively, I added an option to transform the Custom files to HDF5, which seems to be just a little bit slower than mmap.
I went ahead and also implemented the same load to RAM functionality in HDF5.

torchmdnet/data.py Outdated Show resolved Hide resolved
@RaulPPelaez
Copy link
Collaborator Author

This is ready. @stefdoerr could you please review?

tests/test_datasets.py Outdated Show resolved Hide resolved
torchmdnet/data.py Outdated Show resolved Hide resolved
@stefdoerr
Copy link
Collaborator

I think @raimis should review this since he worked on mmaps before. I am not too qualified, so I just commented on style and minor bugs

@RaulPPelaez
Copy link
Collaborator Author

I would like to merge this one before next release, @raimis could you take a look? Thanks

@guillemsimeon
Copy link
Collaborator

is this ready to be shipped?

@RaulPPelaez
Copy link
Collaborator Author

Yes!

@raimis raimis removed their request for review January 16, 2024 14:39
Copy link
Collaborator

@guillemsimeon guillemsimeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf

@guillemsimeon
Copy link
Collaborator

we should merge these

@RaulPPelaez RaulPPelaez merged commit 140ca66 into torchmd:main Jan 23, 2024
2 checks passed
@RaulPPelaez RaulPPelaez deleted the custom_dataset branch January 31, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants