Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training hangs with lightning ddp and cloud dir? #408

Open
rxqy opened this issue Nov 1, 2024 · 3 comments
Open

training hangs with lightning ddp and cloud dir? #408

rxqy opened this issue Nov 1, 2024 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@rxqy
Copy link

rxqy commented Nov 1, 2024

🐛 Bug

Hi, we are using lightning with litdata on our local machine and aws s3 system. However, training would hang randomly during the very first iterations with ddp and remote cloud directory.

I tried several different configurations, but I'm not sure what I should check next.
GPU / Strategy / FileOn / results
1 / No DDP/ local ssd / OK
1 / No DDP/ remote(s3) / OK
8 / DDP/ local ssd / OK
8 / DDP/ remote(s3) / Stuck.

To Reproduce

I'm following the exact steps on the imagenet demo. And I write a trainer myself here.
Just run python train.py with different CUDA_VISIBLE_DEVICES is enough

Code sample
# train.py
import numpy as np
import lightning as L
import torch, torch.nn as nn, torch.utils.data as data, torchvision as tv, torch.nn.functional as F

class LitAutoEncoder(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.decoder = nn.Sequential(nn.Linear(32, 128))

    def training_step(self, batch, batch_idx):
        loss = self.decoder(batch).mean()
        print(self.trainer.local_rank, loss)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


from lightning.data import StreamingDataset, StreamingDataLoader

class ImageNetStreaming(StreamingDataset):
    def __init__(self, ):
        if 1:
            input_dir = "s3:// xxxxx"
            cache_dir = None
        else:
            input_dir = "val"
            cache_dir = None

        max_cache_size = "200GB"
        super().__init__(
            input_dir = input_dir,
            max_cache_size = max_cache_size,
            shuffle = True,
        )

    def __getitem__(self, idx):
        data = super().__getitem__(idx)
        return np.float32(123.)

dataset = ImageNetStreaming()
dataloader = StreamingDataLoader(
    dataset,
    batch_size = 32,
    num_workers = 2,
    pin_memory = True,
    shuffle = True,
    drop_last = True
)

autoencoder = LitAutoEncoder()
trainer = L.Trainer()
trainer.fit(autoencoder, dataloader)

Expected behavior

Training should finish

Additional context

Due to some regulations here we can not put we data or training scirpts on lightning-studio. I'm not sure if something's wrong with our s3 bucket or our our network configuration.
One thing I notice is that even if the training stucks at some iterations(<50), we can still observe large network throughputs on our machine (around 100mb/s), but the local chunk directory( ~/.lightning/chunks) stops growing.

Current environment
  • CUDA:
    • GPU:
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
    • available: True
    • version: 12.1
  • Lightning:
    • lightning: 2.3.0
    • lightning-utilities: 0.11.1
    • pytorch-lightning: 2.2.1
    • torch: 2.2.1
    • torchaudio: 2.2.1
    • torchmetrics: 1.3.2
    • torchvision: 0.17.1
  • Packages:
    • absl-py: 2.1.0
    • accelerate: 0.30.1
    • aiofiles: 23.2.1
    • aiohttp: 3.9.3
    • aiosignal: 1.3.1
    • angle-emb: 0.3.10
    • annotated-types: 0.7.0
    • anyio: 4.4.0
    • async-timeout: 4.0.3
    • attrs: 23.2.0
    • auto-gptq: 0.7.1
    • av: 12.3.0
    • awscli: 1.32.70
    • backports-datetime-fromisoformat: 2.0.1
    • bitsandbytes: 0.43.1
    • blessed: 1.20.0
    • blinker: 1.7.0
    • boltons: 24.0.0
    • boto3: 1.34.143
    • botocore: 1.34.143
    • braceexpand: 0.1.7
    • brotli: 1.0.9
    • certifi: 2024.2.2
    • charset-normalizer: 2.0.4
    • click: 8.1.7
    • colorama: 0.4.4
    • coloredlogs: 15.0.1
    • contourpy: 1.2.1
    • cos-python-sdk-v5: 1.9.30
    • crcmod: 1.7
    • cycler: 0.12.1
    • datasets: 2.14.6
    • decord: 0.6.0
    • deepspeed: 0.14.0
    • dill: 0.3.7
    • dnspython: 2.6.1
    • docker-pycreds: 0.4.0
    • docstring-parser: 0.16
    • docutils: 0.16
    • einops: 0.7.0
    • email-validator: 2.2.0
    • et-xmlfile: 1.1.0
    • exceptiongroup: 1.2.2
    • faiss-gpu: 1.7.2
    • fastapi: 0.111.1
    • fastapi-cli: 0.0.4
    • ffmpy: 0.3.2
    • filelock: 3.13.1
    • fire: 0.6.0
    • flash-attn: 2.5.7
    • flask: 3.0.3
    • fonttools: 4.51.0
    • frozenlist: 1.4.1
    • fsspec: 2023.10.0
    • gekko: 1.2.1
    • gitdb: 4.0.11
    • gitpython: 3.1.43
    • gmpy2: 2.1.2
    • gpustat: 1.1.1
    • gradio: 4.39.0
    • gradio-client: 1.1.1
    • grpcio: 1.62.1
    • h11: 0.14.0
    • hide-warnings: 0.17
    • hjson: 3.1.0
    • httpcore: 1.0.5
    • httptools: 0.6.1
    • httpx: 0.27.0
    • huggingface-hub: 0.23.4
    • humanfriendly: 10.0
    • idna: 3.4
    • importlib-resources: 6.4.0
    • influxdb: 5.3.2
    • itsdangerous: 2.1.2
    • jinja2: 3.1.3
    • jmespath: 1.0.1
    • joblib: 1.4.0
    • jsonargparse: 4.27.7
    • kafka-python: 2.0.2
    • kiwisolver: 1.4.5
    • lightning: 2.3.0
    • lightning-utilities: 0.11.1
    • litdata: 0.2.29
    • llava: 1.7.0.dev0
    • llmtuner: 0.6.3.dev0
    • m3u8: 4.0.0
    • markdown: 3.6
    • markdown-it-py: 3.0.0
    • markupsafe: 2.1.3
    • matplotlib: 3.8.4
    • mdurl: 0.1.2
    • media-metric: 0.2.0.10
    • mkl-fft: 1.3.8
    • mkl-random: 1.2.4
    • mkl-service: 2.4.0
    • mmidls: 2.0.3
    • mpmath: 1.3.0
    • msgpack: 1.1.0
    • multidict: 6.0.5
    • multiprocess: 0.70.15
    • networkx: 3.1
    • ninja: 1.11.1.1
    • nssdk: 0.0.1
    • numpy: 1.26.4
    • nvidia-ml-py: 12.535.133
    • onnx: 1.16.0
    • onnxconverter-common: 1.14.0
    • opencv-python-headless: 4.9.0.80
    • openpyxl: 3.1.5
    • optimum: 1.21.1
    • orjson: 3.10.6
    • packaging: 24.0
    • pandas: 2.2.1
    • peft: 0.11.1
    • pillow: 10.2.0
    • pip: 23.3.1
    • platformdirs: 4.2.2
    • ply: 3.11
    • prettytable: 3.10.0
    • protobuf: 3.20.2
    • psutil: 5.9.8
    • py: 1.11.0
    • py-cpuinfo: 9.0.0
    • pyarrow: 15.0.2
    • pyarrow-hotfix: 0.6
    • pyasn1: 0.5.1
    • pycryptodome: 3.20.0
    • pydantic: 2.7.1
    • pydantic-core: 2.18.2
    • pydub: 0.25.1
    • pygments: 2.18.0
    • pynvml: 11.5.0
    • pyparsing: 3.1.2
    • pyrootutils: 1.0.4
    • pysocks: 1.7.1
    • python-dateutil: 2.9.0.post0
    • python-dotenv: 1.0.1
    • python-multipart: 0.0.9
    • pytorch-lightning: 2.2.1
    • pytz: 2024.1
    • pyyaml: 6.0.1
    • redis: 5.0.3
    • regex: 2023.12.25
    • requests: 2.31.0
    • rich: 13.7.1
    • rocketmq-client-python: 2.0.0
    • rouge: 1.0.1
    • rsa: 4.7.2
    • ruff: 0.5.4
    • s3transfer: 0.10.1
    • safetensors: 0.4.2
    • scikit-learn: 1.4.2
    • scipy: 1.13.0
    • seaborn: 0.13.2
    • semantic-version: 2.10.0
    • sentencepiece: 0.2.0
    • sentry-sdk: 2.5.1
    • setproctitle: 1.3.3
    • setuptools: 68.2.2
    • shellingham: 1.5.4
    • shtab: 1.7.1
    • six: 1.16.0
    • smmap: 5.0.1
    • sniffio: 1.3.1
    • sse-starlette: 2.1.2
    • starlette: 0.37.2
    • sympy: 1.12
    • tabulate: 0.9.0
    • taxonomy: 0.10.0
    • tensorboard: 2.16.2
    • tensorboard-data-server: 0.7.2
    • termcolor: 2.4.0
    • threadpoolctl: 3.4.0
    • thrift: 0.20.0
    • thriftpy2: 0.4.20
    • tiktoken: 0.7.0
    • timm: 1.0.3
    • tokenizers: 0.19.1
    • tomlkit: 0.12.0
    • torch: 2.2.1
    • torchaudio: 2.2.1
    • torchmetrics: 1.3.2
    • torchvision: 0.17.1
    • tqdm: 4.66.2
    • transformers: 4.42.4
    • transformers-stream-generator: 0.0.5
    • triton: 2.2.0
    • trl: 0.9.6
    • typer: 0.12.3
    • typeshed-client: 2.5.1
    • typing-extensions: 4.9.0
    • tyro: 0.8.5
    • tzdata: 2024.1
    • urllib3: 2.1.0
    • uvicorn: 0.30.3
    • uvloop: 0.19.0
    • videollama2: 1.0
    • wandb: 0.17.1
    • watchfiles: 0.22.0
    • wcwidth: 0.2.13
    • webdataset: 0.2.93
    • websockets: 11.0.3
    • werkzeug: 3.0.1
    • wheel: 0.41.2
    • xlrd: 2.0.1
    • xmltodict: 0.13.0
    • xxhash: 3.4.1
    • yarl: 1.9.4
  • System:
@rxqy rxqy added bug Something isn't working help wanted Extra attention is needed labels Nov 1, 2024
Copy link

github-actions bot commented Nov 1, 2024

Hi! thanks for your contribution!, great first issue!

@deependujha
Copy link
Collaborator

deependujha commented Nov 3, 2024

Hi @rxqy, thanks for opening the issue. A similar issue is also open for Sagemaker.

We're looking into it and will try to fix it ASAP.

@rxqy
Copy link
Author

rxqy commented Nov 4, 2024

@deependujha, Many thanks. BTW, the above code sometimes gives the FileNotFoundError (and the training loop continues for several iterations and hangs), and sometimes it just hangs. Not sure if it will help or not, but i'm still pasting it here.

Epoch 0:   0%|                                              | 3/10008 [00:04<4:05:42,  0.68it/s, v_num=3]1 tensor(-45.3273, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-45.3273, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                              | 4/10008 [00:05<3:41:55,  0.75it/s, v_num=3]1 tensor(-61.0723, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-61.0723, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                              | 5/10008 [00:05<2:57:38,  0.94it/s, v_num=3]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/data/miniconda3/envs/pl/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/data/miniconda3/envs/pl/lib/python3.10/site-packages/litdata/streaming/reader.py", line 153, in run
    self._maybe_delete_chunks()
  File "/data/miniconda3/envs/pl/lib/python3.10/site-packages/litdata/streaming/reader.py", line 117, in _maybe_delete_chunks
    self._apply_delete(self._chunks_index_to_be_deleted.pop(0))
  File "/data/miniconda3/envs/pl/lib/python3.10/site-packages/litdata/streaming/reader.py", line 91, in _apply_delete
    os.remove(locak_chunk_path)
FileNotFoundError: [Errno 2] No such file or directory: '/data/.lightning/chunks/b515aeecb3a09f152677fce166405b10/1730182031.4955728/chunk-4-1309.bin.lock'
0 tensor(-76.8173, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                              | 6/10008 [00:05<2:45:20,  1.01it/s, v_num=3]1 tensor(-76.8173, device='cuda:1', grad_fn=<MeanBackward0>)
1 tensor(-92.5623, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-92.5623, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                              | 7/10008 [00:08<3:18:18,  0.84it/s, v_num=3]1 tensor(-108.3073, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-108.3073, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                              | 8/10008 [00:08<2:53:34,  0.96it/s, v_num=3]1 tensor(-124.0523, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-124.0523, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                              | 9/10008 [00:08<2:34:20,  1.08it/s, v_num=3]1 tensor(-139.7973, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-139.7973, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                             | 10/10008 [00:09<2:38:37,  1.05it/s, v_num=3]1 tensor(-155.5423, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-155.5423, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                             | 11/10008 [00:09<2:30:48,  1.10it/s, v_num=3]0 tensor(-171.2873, device='cuda:0', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                             | 12/10008 [00:09<2:18:17,  1.20it/s, v_num=3]1 tensor(-171.2873, device='cuda:1', grad_fn=<MeanBackward0>)
0 tensor(-187.0323, device='cuda:0', grad_fn=<MeanBackward0>)
1 tensor(-187.0323, device='cuda:1', grad_fn=<MeanBackward0>)
Epoch 0:   0%|                                             | 13/10008 [00:10<2:13:49,  1.24it/s, v_num=3]1 tensor(-202.7773, device='cuda:1', grad_fn=<MeanBackward0>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants