-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statement of the problem and review of possible existing solutions #1
Comments
Have you looked at robinhood? We use it for one of our storage offerings. https://github.com/cea-hpc/robinhood/wiki. |
the ones i know are proprietary, but more importantly almost all of them rely on kernel level and daemon processes. you can do a few things locally as a user, but they are relatively slow. it also depends on what the exact use cases are. nina's husband don is the cto of https://starfishstorage.com/ - one of those proprietary offerings. a lot of this will depend on scale, filesystem, etc.,.:
so i think it would be good to describe what your immediate scope is (number of inodes, userspace constraints, etc.,.). |
Thank you @vsoch and @satra for feedback. |
If you need a chonker repository to test I can offer https://github.com/vsoch/dockerfiles - it doesn't have large files, but has ~100K files (version 1 had ~130K) and they are Dockerfiles so it's a consistent thing to search (or generally run functions) over. I think it would be worth poking robinhood to ask for a pointer to a sqlite implementation - that seems to be a "scaled production" solution that might be modified to work with a little, local filesystem. But the tradeoff for such a solution is that it requires you to add more layers of dependencies - using such a helper probably would require more than a |
py3 uses os.scandir underneath which signficantly speeds up traversal. that's not too bad for 143K folders containing files
yes, one would need to build up a few more things, but it seems even just a stat is relatively quick:
|
Thank you @satra ! I have tried it out on a number of cases of interest and it all looks super promising and possibly sufficient for what I wanted! FWIW your nice one liner is now a very ugly script which I had composed to quickly compare (it was late and I thought it would be just a quick hack so just did it in plain vim ;)) different possible tune ups is at https://gist.github.com/yarikoptic/7284b634d8ab12277a3d316a117cc0db . Here is a sample output from a tricky case where depending on the use case could be made notably faster (by ignoring all
Cold run was quite long, but a) it is 106247 directories with 212494 files b) warm traversal was faster than first warmish git status! (and that one doesn't care about all those objects while walk.py did) Because I hope to get to it some time soon by RFing dandi-cli's helper to be able to handle directory paths and/or explicit lists of paths then I could try on PyBIDS and/or DataLad for .config reload , which is currently has its own dedicated implementation which imho could be replaced with this more "modular" one. edit 1: this exercise made me into a believer that not all Python code is damn slow ;) |
High level overview
An increasing number of projects need to operate on growing large collections of files. Having (hundreds of ) thousands of files within a file tree is no longer atypical e.g. in neuroimaging. Many supplementary tools/functions need to "walk" the file tree and return result of some function. Such result should not change between invocations if nothing in the file tree changes (no file added/removed/changed) and theoretically output of the function should remain the same.
Situation scales down all the way to individual files: function could operate on a specific file path and return result which should not change if file does not change.
"Fulfilled" use-case
git
already does some level of caching to speed upgit status
(see comment in datalad below), so I think it should be possible to provide some generic helper which would cache the result of operation on some path argument (file or a tree).dandi-cli
: loading metadata from .nwb file which takes a while. For that purpose, based on joblib.Memory I came up with a PersistentCache class to provide a general persistent (across processes) cache, and a.memoize_path
method which is a decorator to decorate any function which takes path to a file as the first argument.Implementation
PersistentCache
mentioned above also can account (see example) for versions of the relevant python modules placing them into signature, so if there is an update, cached result will not be used.Target use-cases
DataLad:
datalad status
and possibly some (e.g.diff
) other commands operation. Whenevergit status
already uses some smart caching of the results, so subsequent invocation takes advantage of it, there is nothing like that within datalad yet.PyBIDS: construction of the
BIDSLayout
could take awhile. If instance (or at least a list of walked paths) could be cached -- would speed up subsequent invocations. (ref: BIDSLayout performance on very large datasets, continued bids-standard/pybids#609 (comment))dandi-cli
: ATM we cache per file, but as larger in number of files datasets appear, we might like to cache results on full file treesIf anyone (attn @kyleam @mih @vsoch @con @tyarkoni @effigies @satra) knows an already existing solution -- would be great!
The text was updated successfully, but these errors were encountered: