-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File storage interface improvements #256
base: master
Are you sure you want to change the base?
Conversation
Things that we think still need to be done to this:
|
b9280c6
to
067b71c
Compare
I think this is getting there. There's some documentation on what file storage backends now look like at https://github.com/inveniosoftware/invenio-files-rest/blob/9bc8a83163fc50d9b407e29a82d7ef6701770f48/docs/new-storage-backend.rst I've moved the old implementation into The general idea was:
Things still to do:
|
This adds overridable config variables for mapping storage backend names to classes (by default drawn from declared entry points) and setting a default backend for new files. It also adds a pyfs storage backend entry point, and makes it the default. Storage backends should be `django_files_rest.storage:FileStorage` subclasses, and declared with a `django_files_rest.storage` entrypoint.
…gress At the moment, the storage backend is responsible for computing checksums and reporting the size of files based on the number of bytes read. However, these are common concerns across all storage backends, and so this functionality can be pulled the other side of the storage backend interface. This class is therefore intended to wrap any file-like object, and to checksum the file as it is read. It also counts bytes and calls the progress callback if one is provided. In a subsequent commit we'll integrate it with FileInstance, and remove the functionality from the PyFS storage backend.
This new storage factory uses a `FileInstance.storage_backend` attribute to determine which backend to use. It uses the `FILES_REST_DEFAULT_STORAGE_BACKEND` config value when the file instance doesn't already have a storage backend. It can be extended to customize how the backend is chosen, and how it is initialised. The file path is chosen based on the `id` of the file instance, but this can be customised too. It's not set as the default storage factory to preserve backwards compatibility.
…nit classes This property can now instantiate class-based storage factories, i.e. the new factory in `invenio_files_rest.storage.factory`. It's possible the deprecation should be in the pyfs storage factory itself.
These aren't Py3.6-compatible, so will probably need to be removed later, or replaced with comment-based annotations.
This will mean we can later remove this functionality from the storage backend.
This is exposed as a class property on FileStorage and can be accessed as `FileStorage.backend_name`. It returns the associated backend name in the application config, which can then be used to find the class.
It no longer needs to do checksumming, and the filepath can be moved to the superclass.
All together because this has been very exploratory.
There's always a better option than metaclasses. In this case, a classmethod, which also simplifies getting the backend name on instances (`type(instance).backend_name` → `instance.get_backend_name()`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive job! Thanks a lot for working on this and keeping backward compatibility. I've made a couple of comments
I think the main comment is probably the removal of copy()
. It might be easiest to discuss over a telecon. Honestly, I think we can go ahead without adding it back in, but perhaps we make a comment in the code, that this could optimized in the code at a later stage.
FILES_REST_DEFAULT_STORAGE_BACKEND = 'pyfs' | ||
"""The default storage backend name.""" | ||
|
||
FILES_REST_STORAGE_BACKENDS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I see some potential troubles here.
How are users expected to provide storage backend?
- Only via specifying entry points? In this case, we should probably not make it config, but just a method on the Flask extension.
- Allow developers/instances to specify a dict of storage backends? in this case, it makes sense to have a config. However when
FILES_REST_STORAGE_BACKENDS
is overwritten, then below code will still execute causing all storage backends to be loaded, even if we don't want them.
Thus in both cases, I think it makes sense to move the entry point iteration to the Flask extension to avoid unnecessarily loading storage backends that we don't need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was supporting both so that there was an obvious default behaviour (entry points), but that they could override this if desired. The "overriding" use case was that they wanted to change the behaviour of a backend they'd already used, but which they don't maintain (e.g. one in invenio-files-rest, or invenio-s3). In this case defining a new entrypoint with the same name wouldn't work, so they'd have to define the name→backend mapping explicitly as config.
However, yes, it's not great to do the entrypoint dereferencing here regardless of whether it's overridden. If you're happy with the rationale above I can move this to the extension loading code.
I should also write documentation for this.
@@ -791,13 +807,45 @@ def verify_checksum(self, progress_callback=None, chunk_size=None, | |||
else (self.checksum == real_checksum)) | |||
self.last_check_at = datetime.utcnow() | |||
return self.last_check | |||
return current_files_rest.storage_factory(fileinstance=self, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Two returns - seems like a merge problem I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it does look that way 😬. Will fix!
preferred_location = ( # type: typing.Optional[Location] | ||
Location(uri=default_location) if default_location else None | ||
) | ||
return self.initialize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Previously the init_contents()
didn't have a return value. I assume init_contents()
is intended to be replaced completely with initialize()
, hence would it make sense to keep it like that, or is the return value need somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; there's no return from initialize()
, so the return
in init_contents()
is unnecessary. I'll get it removed :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yep, I was intending that init_contents()
be deprecated.
self.set_uri( | ||
*self.storage(**kwargs).initialize(size=size), | ||
readable=False, writable=True) | ||
if hasattr(current_files_rest.storage_factory, 'initialize'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Perhaps better with isinstance(current_files_rest.storage_factory, StorageBackend)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I think this was a hangover from when I'd considered keeping the same base class between old and new approaches, but then decided against it.
'uri', 'size', 'checksum', 'writable', 'readable', 'storage_class' | ||
} | ||
|
||
def update_file_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should set_uri()
be deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please! Its use of positional arguments didn't seem particularly clear for people using it, and seems inflexible when trying to omit attribute names or add new attribute names later.
) | ||
self._size = result['size'] | ||
if not result['checksum']: | ||
result['checksum'] = self.checksum(chunk_size=chunk_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is this for the case where the checksum is not calculated as part of write stream and we want to ask the backend to do it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. The idea was to make the backend implementation as light-touch as possible and for it to fill in the gaps around what you've provided, unless you want to take control of those things yourself.
In this case, setting the hash name to None will result in checksum not being set from the default write implementation, at which point it'll ask the backend to calculate the checkum explicitly.
'uri': self.uri, | ||
'readable': True, | ||
'writable': False, | ||
'storage_class': 'S', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should this be taken from FILES_REST_DEFAULT_STORAGE_CLASS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! WIll fix.
|
||
def copy(self, src, chunk_size=None, progress_callback=None): | ||
"""Copy data from another file instance. | ||
|
||
:param src: Source stream. | ||
:param chunk_size: Chunk size to read from source stream. | ||
""" | ||
fp = src.open(mode='rb') | ||
try: | ||
warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: The reason for having copy on the storage backend is that it could potentially perform the copy using the backend. Say if both files reside in the same storage system, it would be a lot more efficient to do an internal copy or that system, than having to pass all data through Invenio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. It shouldn't be too much effort for me to reinstate it.
"""Storage initialization.""" | ||
self.fileurl = fileurl | ||
# if isinstance(args[0], str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep; will remove.
tests/test_cli.py
Outdated
@@ -103,6 +103,7 @@ def test_simple_workflow(app, db, tmpdir): | |||
], obj=script_info) | |||
assert 0 == result.exit_code | |||
|
|||
print(tmpdir.listdir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Left-over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep; will remove.
@lnielsen All replies done. Will work on the improvements soon, if you're happy with the suggested approach. I'll do everything as fixups for easier reviewing, and then squash them once we're happy. |
abd43b6
to
d1fb4d2
Compare
Currently for preliminary review.
See inveniosoftware/rfc#37 for background.
Not all tests pass yet, as the changes to
PyFSFileStorage
haven't been reflected in the tests.