If data samples contain a Python list with a variable number of elements, type inference will fail #260

senarvi · 2024-07-23T08:07:35Z

🐛 Bug

When optimizing a dataset, BinaryWriter.serialize() will first flatten the sample dictionary and infer the type of each element. Then it will assume that all data samples have the same types and the same number of elements. If a list contains a variable number of elements, the type information will be incorrect for subsequent data samples. This is not documented and can cause some confusion.

To Reproduce

When storing the data in NumPy arrays, there's no problem, because each array is considered as one element in the flattened type list. So this works:

import numpy as np
from PIL import Image
import litdata as ld

def random_images(index):
    fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))
    num_labels = np.random.randint(10)
    fake_labels = np.random.randint(10, size=num_labels)

    # use any key:value pairs
    data = {"index": index, "image": fake_images, "class": fake_labels}

    return data

if __name__ == "__main__":
    # the optimize function outputs data in an optimized format (chunked, binerized, etc...)
    ld.optimize(
        fn=random_images,                   # the function applied to each input
        inputs=list(range(1000)),           # the inputs to the function (here it's a list of numbers)
        output_dir="my_optimized_dataset",  # optimized data is stored here
        num_workers=4,                      # The number of workers on the same machine
        chunk_bytes="64MB"                  # size of each chunk
    )

When storing the data in a Python list, the type of each list element is inferred separately. If a list contains a variable number of elements, the type information of one sample is not useful for other samples. Now, if we add an element that has a different type (say, a string) after a variable-length list, the function will crash:

def random_images(index):
    fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))
    num_labels = np.random.randint(10)
    fake_labels = np.random.randint(10, size=num_labels).tolist()
    name = "name"

    # use any key:value pairs
    data = {"index": index, "image": fake_images, "class": fake_labels, "name": name}

    return data

The error:

Rank 0 inferred the following `['int', 'pil', 'int', 'int', 'int', 'str']` data format.                                                                                                                                                                         | 0/1000 [00:00<?, ?it/s]
Rank 1 inferred the following `['int', 'pil', 'int', 'int', 'int', 'str']` data format.
Worker 1 is done.
Traceback (most recent call last):
  File ".../example.py", line 20, in <module>
    ld.optimize(
  File ".../lib/python3.12/site-packages/litdata/processing/functions.py", line 432, in optimize
    data_processor.run(
  File ".../lib/python3.12/site-packages/litdata/processing/data_processor.py", line 1055, in run
    self._exit_on_error(error)
  File ".../lib/python3.12/site-packages/litdata/processing/data_processor.py", line 1119, in _exit_on_error
    raise RuntimeError(f"We found the following error {error}.")
RuntimeError: We found the following error Traceback (most recent call last):
  File ".../lib/python3.12/site-packages/litdata/processing/data_processor.py", line 665, in _handle_data_chunk_recipe
    chunk_filepath = self.cache._add_item(self._index_counter, item_data_or_generator)
  File ".../lib/python3.12/site-packages/litdata/streaming/cache.py", line 134, in _add_item
    return self._writer.add_item(index, data)
  File ".../lib/python3.12/site-packages/litdata/streaming/writer.py", line 305, in add_item
    data, dim = self.serialize(items)
  File ".../lib/python3.12/site-packages/litdata/streaming/writer.py", line 179, in serialize
    self._serialize_with_data_format(flattened, sizes, data, self._data_format)
  File ".../lib/python3.12/site-packages/litdata/streaming/writer.py", line 210, in _serialize_with_data_format
    serialized_item, _ = serializer.serialize(element)
  File ".../lib/python3.12/site-packages/litdata/streaming/serializers.py", line 347, in serialize
    return obj.encode("utf-8"), None
AttributeError: 'int' object has no attribute 'encode'

Expected behavior

It should be documented that every data sample must have the same elements and every list must be the same size.

I wonder if caching the data types is necessary. Afterall, the optimize() call doesn't have to be that fast. If the data types were instead inferred for every sample separately, it would be possible to use variable-length lists.

Would it make sense to at least check in BinaryWriter.serialize() that self._data_format has the same number of elements as the data sample?

Environment

PyTorch Version (e.g., 1.0): 2.3.1
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Python version: 3.12.3

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-23T08:07:58Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-07-23T10:17:08Z

Hey @senarvi,

Would you mind making a PR to add a note in the README about this ?

Additionally, we could consider adding placeholder to tell LitData to not de-composed the pytree too deep. Like a List or Dict object that would be serialized as its own.

senarvi · 2024-07-24T07:05:03Z

Here's a PR with some documentation changes: #264

I didn't quite understand how the placeholder would work.

tchaton · 2024-07-24T11:17:59Z

Hey @senarvi,

I was thinking something like this.

from litdata.types import List, Dict

def fn(...):

	return {"list": List([...])}

This way, the pytree won't be decomposed and the list would be encoded as single element.

But the reality is that any placeholder would do the job as pytree won't know how to decomposed it further down.

Example:

class Placeholder:

	def __init__(self, object):
		self.object = object

senarvi · 2024-07-24T19:35:34Z

@tchaton something like that could be an elegant solution. I guess then you need to write a serializer for the Placeholder class. Was your idea that all the elements of List would have a specific data type?

tchaton · 2024-07-24T20:22:00Z

Yes, we could have something smart for it if we want to. Would you be interested into contributing such feature ?

senarvi · 2024-07-24T20:40:39Z

I don't have the possibility at the moment. I was able to easily work around the problem by using NumPy arrays instead of Python lists. Now I'll have to move on and get something working to see if this framework is suitable for us.

tchaton · 2024-07-24T21:31:55Z

Hey @senarvi. Thanks for trying LitData, let me know how it goes ;) Use the latest version too. I will make a new release tomorrow.

senarvi added bug Something isn't working help wanted Extra attention is needed labels Jul 23, 2024

senarvi mentioned this issue Jul 24, 2024

Update map() and optimize() documentation #264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If data samples contain a Python list with a variable number of elements, type inference will fail #260

If data samples contain a Python list with a variable number of elements, type inference will fail #260

senarvi commented Jul 23, 2024

github-actions bot commented Jul 23, 2024

tchaton commented Jul 23, 2024

senarvi commented Jul 24, 2024

tchaton commented Jul 24, 2024 •

edited

Loading

senarvi commented Jul 24, 2024

tchaton commented Jul 24, 2024

senarvi commented Jul 24, 2024

tchaton commented Jul 24, 2024 •

edited

Loading

If data samples contain a Python list with a variable number of elements, type inference will fail #260

If data samples contain a Python list with a variable number of elements, type inference will fail #260

Comments

senarvi commented Jul 23, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

github-actions bot commented Jul 23, 2024

tchaton commented Jul 23, 2024

senarvi commented Jul 24, 2024

tchaton commented Jul 24, 2024 • edited Loading

senarvi commented Jul 24, 2024

tchaton commented Jul 24, 2024

senarvi commented Jul 24, 2024

tchaton commented Jul 24, 2024 • edited Loading

tchaton commented Jul 24, 2024 •

edited

Loading

tchaton commented Jul 24, 2024 •

edited

Loading