-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preload small datasets to memory in Custom dataset #235
Conversation
@AntonioMirarchi please review! |
The Custom dataset class is incredibly inefficient. It reloads the whole file every time |
In fact, possibly we should just make Custom create a temporary HDF5 file on startup and then load from it. |
I really like this idea Peter, thanks! |
Precompute as much as possible by storing mmap'ed files.
h5py are slower than mmap arrays. Maybe I am missing the narrative but
why are we doing something different from what we are already doing in
other dataloaders?
g
…On Mon, Oct 16, 2023 at 11:05 AM Antonio Mirarchi ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In torchmdnet/datasets/custom.py
<#235 (comment)>:
> + with h5py.File(hdf5_dataset, "w") as f:
+ for i in range(len(files["pos"])):
+ # Create a group for each file
+ coord_data = np.load(files["pos"][i])
+ embed_data = np.load(files["z"][i]).astype(int)
+ group = f.create_group(str(i))
+ num_samples = coord_data.shape[0]
+ group["pos"] = coord_data
+ group["types"] = np.tile(embed_data, (num_samples, 1))
+ if "y" in files:
+ energy_data = np.load(files["y"][i])
+ group["energy"] = energy_data
+ if "neg_dy" in files:
+ force_data = np.load(files["neg_dy"][i])
+ group["forces"] = force_data
It's just the "proper way", in theory you should be able to move across
path in the hdf5 file (it's not this case) . For example if you use
"create_dataset" you can retrieve the pos dataset using:
f = h5py.File(hdf5_dataset, "r")f["0/pos"]
While if you use the "dictionary way" you get this error:
KeyError: "Unable to open object (object 'pos' doesn't exist)"
—
Reply to this email directly, view it on GitHub
<#235 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOUKJXJDPK7SBVMBLRTX7T2FNANCNFSM6AAAAAA54GA4WI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
This PR exists because Custom was calling np.load() at each call of get(). Even with mmap mode this was really slowing down trainings (it is I/O bound in the end...). |
This is ready. @stefdoerr could you please review? |
I think @raimis should review this since he worked on mmaps before. I am not too qualified, so I just commented on style and minor bugs |
I would like to merge this one before next release, @raimis could you take a look? Thanks |
is this ready to be shipped? |
Yes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perf
we should merge these |
While replicating results from the torchmd-protein-thermodynamics repository, I experienced sluggish training speeds and low GPU usage (meaning sitting at 0% and briefly going to 100% at each iteration) using the following configuration file:
The referenced files take approx 300mb.
Playing around with num_workers and batch size did not help.
Upon investigation, the issue arose from the Custom dataset's I/O-bound get method, which reads from disk every time it is invoked, causing the low GPU usage and slow training.
To resolve this, I implemented a preloading feature that loads the complete dataset into system memory if its size is below a user-configurable threshold, set by default at 1GB. The data is stored as PyTorch tensors, facilitating compatibility with multi-threaded data loaders (num_workers). Notably, this approach does not inflate RAM usage when increasing the number of workers.
This optimization led to a x20 speedup in training time for this specific setup.
OTOH I tweaked the DataLoader options a bit.