-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If data samples contain a Python list with a variable number of elements, type inference will fail #260
Comments
Hi! thanks for your contribution!, great first issue! |
Hey @senarvi, Would you mind making a PR to add a note in the README about this ? Additionally, we could consider adding placeholder to tell LitData to not de-composed the pytree too deep. Like a |
Here's a PR with some documentation changes: #264 I didn't quite understand how the placeholder would work. |
Hey @senarvi, I was thinking something like this. from litdata.types import List, Dict
def fn(...):
return {"list": List([...])} This way, the pytree won't be decomposed and the list would be encoded as single element. But the reality is that any placeholder would do the job as pytree won't know how to decomposed it further down. Example: class Placeholder:
def __init__(self, object):
self.object = object |
@tchaton something like that could be an elegant solution. I guess then you need to write a serializer for the |
Yes, we could have something smart for it if we want to. Would you be interested into contributing such feature ? |
I don't have the possibility at the moment. I was able to easily work around the problem by using NumPy arrays instead of Python lists. Now I'll have to move on and get something working to see if this framework is suitable for us. |
Hey @senarvi. Thanks for trying LitData, let me know how it goes ;) Use the latest version too. I will make a new release tomorrow. |
🐛 Bug
When optimizing a dataset,
BinaryWriter.serialize()
will first flatten the sample dictionary and infer the type of each element. Then it will assume that all data samples have the same types and the same number of elements. If a list contains a variable number of elements, the type information will be incorrect for subsequent data samples. This is not documented and can cause some confusion.To Reproduce
When storing the data in NumPy arrays, there's no problem, because each array is considered as one element in the flattened type list. So this works:
When storing the data in a Python list, the type of each list element is inferred separately. If a list contains a variable number of elements, the type information of one sample is not useful for other samples. Now, if we add an element that has a different type (say, a string) after a variable-length list, the function will crash:
The error:
Expected behavior
It should be documented that every data sample must have the same elements and every list must be the same size.
I wonder if caching the data types is necessary. Afterall, the
optimize()
call doesn't have to be that fast. If the data types were instead inferred for every sample separately, it would be possible to use variable-length lists.Would it make sense to at least check in
BinaryWriter.serialize()
thatself._data_format
has the same number of elements as the data sample?Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: