[kedro-datasets] fsspec mixin #4315
Replies: 6 comments
-
Hi @fdroessler, thanks a lot for your interest in Kedro! Sharing some thoughts here: Personally I'm not a big fan of mixins and multiple inheritance. If we went for a composition-over-inheritance approach maybe we could have something like an Beware that the engineering team might have a different opinion regarding this mixin. But more importantly, it's up to debate whether this relatively small refactor is worth taking in the face of #1778 Waiting for other colleagues to chime in. |
Beta Was this translation helpful? Give feedback.
-
Thanks @astrojuanlu, I was not aware of this major refactor. I guess this is some ways off though or do you have more details on the timeline? |
Beta Was this translation helpful? Give feedback.
-
@fdroessler https://github.com/kedro-org/kedro/milestone/12 Can you see this? I believe we open access for everyone but let me know if this is not the case. We are aware of the datasets problem, it's not quite flexible as you can see the fsspec wasn't part of the Abstract class contract, but we almost used it everywhere. Can you show an example why you need to refactor all datasets? In the past we have added new protocol and it's usually quite straight forward to do so, let's see if we can do something about it as this refactoring is not likely happen very soon. |
Beta Was this translation helpful? Give feedback.
-
@noklam Yes I can see that, thanks for pointing that out. The problem with azureml-fsspec is that they did not implement it similar to adfs or adl ... but such that you have to provide the url to the storage when instantiating the fs. Might be a bug but not sure tbh. This leads to the following modifications on my side: Also note that azureml-fsspec just supports reading at this point, they have an option to "upload" but that has to go through a separate method on the fs instance. add and then:
This is what I would have to add to all the datasets in order to support access to them for azureml. This got me thinking on how to simplify modifications like this and avoid code redundancy. However, at the time of first writing I was not aware of the major refactor of datasets. Nonetheless it would be great to have something added to support azureml-fsspec if possible. This does not have to be multiple inheritance or composition at this stage, I'm happy to add the code snippet in all the relevant datasets if necessary. Might also be that the read-only restriction is too limited for kedro 🤷 however I am working also to add some functionality to |
Beta Was this translation helpful? Give feedback.
-
@fdroessler Without thinking about it more, I generally like the idea of splitting some of these functionalities into mix-ins, as you propose. There's some precedent for that (see #15), even though the Haven't consider the adapters proposal from @astrojuanlu at all; need to learn design patterns again to give good insight into mix-in vs. adapter vs. something else. 😅 |
Beta Was this translation helpful? Give feedback.
-
I think excessive copy-pasting in datasets is still an issue (see loooong discussion in #1936) And yet, I stand by what I said in other places: I think inheritance is mostly bad, and multiple inheritance is borderline evil and difficult to reason about. For now, I'm moving this to Discussions. |
Beta Was this translation helpful? Give feedback.
-
Description and context
I am looking into integrating the relatively new
azureml-fsspec
bindings to allow loading (and potentially saving) of datasets directly from an azure machine learning studio datastore. This is slightly different to blob storage or adlfs and just got their 1.0.0 release recently. Due to their implementation of fsspec, when initialising the file system, one needs to pass theuri
of the dataset that is to be read. This is not necessary in any other fsspec implementation and I'm unsure if it is a feature or a but, this I am still investigating.However, as part of this I figured that changing something in the
fsspec
-filesystem initialisation, which is used in the majority of file-based datasets, would result in the need to copy and paste the changed code into all of the datasets that use that functionality. This is potentially error prone and could lead to accidental mistakes further down the line.One solution I could come up with, and this certainly is not necessarily the right one, was providing a
FileSystemMixin
class. This could be used to provide thefsspec
-filesystem capability in all datasets that need it while allowing for modifications if necessary. I guess one could use a function that returns the file system as well but I always liked the Mixin pattern for some reason. Also this would need further work in the datasets__init__
function that I have not thought through.Happy to contribute a PR if this is of interest.
Possible Implementation
A minimal working example that I used to try out the
azureml-fsspec
. Theif self._protocol == "azureml": ...
can be ignored but would be necessary to supportazureml-fsspec
in their current implementation.Modified
pandas.CSVDataset
omitting docstrings for brevity.Possible Alternatives
Not tried but I guess something like this could work too, although I guess this would need some additional definitions in the Dataset
__init__
such as :self._protocol
etc for use inload
orsave
etc.Beta Was this translation helpful? Give feedback.
All reactions