The config isn't consistent between chunks #370

AugustDev · 2024-09-17T03:18:42Z

I was processing large files and received the following error. It failed at around ~80% of the data after about ~1h 20min. The full error is really long, but this is the beginning of it. I'm essentially storing 5 columns where the type of each column is a numpy array. Arrays are of variable length.

🐛 Bug

   File "/root/.nextflow-bin/litdata_dataset.py", line 91, in <module>
      main(
    File "/root/.nextflow-bin/litdata_dataset.py", line 37, in main
      ld.optimize(
    File "/usr/local/lib/python3.12/site-packages/litdata/processing/functions.py", line 445, in optimize
      data_processor.run(
    File "/usr/local/lib/python3.12/site-packages/litdata/processing/data_processor.py", line 1134, in run
      result = data_recipe._done(len(user_items), self.delete_cached_files, self.output_dir)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/litdata/processing/data_processor.py", line 802, in _done
      merge_cache._merge_no_wait(node_rank if num_nodes > 1 else None, getattr(self, "existing_index", None))
    File "/usr/local/lib/python3.12/site-packages/litdata/streaming/cache.py", line 156, in _merge_no_wait
      self._writer._merge_no_wait(node_rank=node_rank, existing_index=existing_index)
    File "/usr/local/lib/python3.12/site-packages/litdata/streaming/writer.py", line 470, in _merge_no_wait
      raise Exception(
  Exception: The config isn't consistent between chunks. This shouldn't have happened.Found {'chunk_bytes': 64000000, 'chunk_size': None, 'compression': 'zstd', 'data_format': ['int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int', 'int'], 'data_spec': '[1, {"type": "builtins.dict", "context": "[\\"input_ids\\", \\"chromosome_idx\\", \\"pos_in_chr_ones\\",...

To Reproduce

Unfortunately I'm not sure how to show how to reproduce without sharing ~100gb dataset.

Code sample

def get_data_from_file_row_group(group):
    """
    Concurrency safe batch samples from parquet row groups.
    """
    file_path, row_group = group
    with pq.ParquetFile(file_path) as pf:
        yield from pf.read_row_groups([row_group]).to_pylist()

def get_data_from_file_row_group(group):
    """
    Concurrency safe batch samples from parquet row groups.
    """
    file_path, row_group = group
    with pq.ParquetFile(file_path) as pf:
        yield from pf.read_row_groups([row_group]).to_pylist()

file_paths = glob.glob(f"{input_dir}/*.parquet")
groups = generate_file_row_groups(file_paths)

ld.optimize(
      fn=get_data_from_file_row_group,
      inputs=groups,
      chunk_bytes="64MB",
      num_workers=num_workers,
      output_dir=f"./output/{dir_prefix}",
      compression="zstd",
)

Additional context

Environment detail

PyTorch Version: 2.4.1
OS (e.g., Linux): Debian 11
Lit data version: 0.2.26
Python version: 3.10

The text was updated successfully, but these errors were encountered:

deependujha · 2024-09-17T03:44:11Z

Hi @AugustDev sorry that it failed at ~80%.

Btw, were you using use_checkpoint = True? It can help you in case of any failure.

And, The config isn't consistent between chunks; it should have printed config and data[config] that mismatched. If logs are still available, can you check what's the cause of the mismatch?

tchaton · 2024-09-17T14:23:15Z

Yes, LitData encodes each leaf of the pytree as a single object and therefore, it doesn't know this is a single sample.

You can convert it to numpy or torch tensor directly to inform LitData this is a single item and not a list of items.

bhimrazy · 2024-09-22T17:32:46Z

Hi @AugustDev, I wanted to follow up and see if the solution recommended by @tchaton was helpful for you.

AugustDev · 2024-09-25T13:24:03Z

Hi @bhimrazy @tchaton thank you for the reply. Would you say that as long I am saving data as torch.tensor or numpy array there should be no problems loading the data?

bhimrazy · 2024-09-26T11:38:21Z

Hi @bhimrazy @tchaton thank you for the reply. Would you say that as long I am saving data as torch.tensor or numpy array there should be no problems loading the data?

Yes, @AugustDev. Let us know how it goes.

Also, if you could recommend any similar publicly available data for testing on my end, that would be helpful.
Thank you! 😊

AugustDev added bug Something isn't working help wanted Extra attention is needed labels Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The config isn't consistent between chunks #370

The config isn't consistent between chunks #370

AugustDev commented Sep 17, 2024 •

edited

Loading

deependujha commented Sep 17, 2024

tchaton commented Sep 17, 2024

bhimrazy commented Sep 22, 2024

AugustDev commented Sep 25, 2024

bhimrazy commented Sep 26, 2024 •

edited

Loading

The config isn't consistent between chunks #370

The config isn't consistent between chunks #370

Comments

AugustDev commented Sep 17, 2024 • edited Loading

🐛 Bug

To Reproduce

Additional context

deependujha commented Sep 17, 2024

tchaton commented Sep 17, 2024

bhimrazy commented Sep 22, 2024

AugustDev commented Sep 25, 2024

bhimrazy commented Sep 26, 2024 • edited Loading

AugustDev commented Sep 17, 2024 •

edited

Loading

bhimrazy commented Sep 26, 2024 •

edited

Loading