Training on WSL with a 4090 #167

johnmccabe · 2023-08-07T21:40:34Z

johnmccabe
Aug 7, 2023

Hi,
I've been hitting a brick wall trying to get the training step to run successfully in Docker WSL in Windows 11 with a 4090.

I can get as far as running piper_train but it bombs out with the following after a few seconds (see full error here - piper_train_error.txt).

RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

My Dockerfile looks like this (the base image ships with Python 3.8 which wasn't working):

FROM nvcr.io/nvidia/pytorch:22.03-py3

RUN pip3 install \
    'pytorch-lightning'

ENV NUMBA_CACHE_DIR=.numba_cache

COPY . /app

RUN apt update \
    && apt upgrade -y \
    && DEBIAN_FRONTEND=noninteractive apt install -y software-properties-common \
    && apt install -y build-essential \
        checkinstall \ 
        libreadline-gplv2-dev \
        libncursesw5-dev \
        libssl-dev \
        libsqlite3-dev \
        tk-dev \
        libgdbm-dev \
        libc6-dev \
        libbz2-dev \
        curl \
        cmake \
        espeak-ng \
    && add-apt-repository -y ppa:deadsnakes/ppa \
    && apt update \
    && apt install -y python3.10 python3.10-dev python3.10-venv \
    && curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

And I then run it with the following:

docker run --rm -ti --gpus all -v ${PWD}/training:/training -v ${PWD}/checkpoints:/checkpoints piper-training bash

Before finishing the setup (trying to get the process working before fully containerising the environment).

Adding torchmetrics==0.11.4 to requirements.txt to get over the ImportError: cannot import name '_compare_version' from 'torchmetrics.utilities.imports' error as well.

cd /app/src/python
python3.10 -m venv .venv
source .venv/bin/activate
pip3 install -e .
./build_monotonic_align.sh

And finally running the training.

.venv/bin/python3.10 -m piper_train \
    --dataset-dir /training/ \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 32 \
    --validation-split 0.0 \
    --num-test-examples 0 \
    --max_epochs 10000 \
    --resume_from_checkpoint /checkpoints/epoch%3D9029-step%3D2261720.ckpt \
    --checkpoint-epochs 1 \
    --precision 32 \
    --max-phoneme-ids 400

The pytorch project references the error in pytorch/pytorch#88038, but that jumping to a newer cuda version would pull in torch 2.x which would just get blown away when installing the piper_train module.

Anything obvious jump out as being wrong in my steps, or can you share the steps that have worked for you so I can reproduce.

Thanks !!

johnmccabe · 2023-08-07T22:18:54Z

johnmccabe
Aug 7, 2023
Author

Some additional info

(.venv) root@a054dda2f472:/app/src/python# .venv/bin/python3.10 -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.3
Libc version: glibc-2.31

Python version: 3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.6.112
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 536.67
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.3
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] pytorch-lightning==1.7.7
[pip3] torch==1.13.1
[pip3] torchmetrics==0.11.4
[conda] magma-cuda110             2.5.2                         5    local
[conda] mkl                       2019.5                      281    conda-forge
[conda] mkl-include               2019.5                      281    conda-forge
[conda] numpy                     1.22.3           py38h05e7239_0    conda-forge
[conda] pytorch-lightning         2.0.6                    pypi_0    pypi
[conda] pytorch-quantization      2.1.2                    pypi_0    pypi
[conda] torch                     1.12.0a0+2c916ef          pypi_0    pypi
[conda] torch-tensorrt            1.1.0a0                  pypi_0    pypi
[conda] torchmetrics              1.0.2                    pypi_0    pypi
[conda] torchtext                 0.12.0a0                 pypi_0    pypi
[conda] torchvision               0.13.0a0                 pypi_0    pypi

2 replies

bobur-amirov May 3, 2024

I have the same problem. Do you have a solution for this?

johnmccabe May 31, 2024
Author

Moved on to some other side projects and never got back to it I'm afraid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on WSL with a 4090 #167

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Training on WSL with a 4090 #167

johnmccabe Aug 7, 2023

Replies: 1 comment · 2 replies

johnmccabe Aug 7, 2023 Author

bobur-amirov May 3, 2024

johnmccabe May 31, 2024 Author

johnmccabe
Aug 7, 2023

Replies: 1 comment 2 replies

johnmccabe
Aug 7, 2023
Author

johnmccabe May 31, 2024
Author