Training on WSL with a 4090 #167
johnmccabe
started this conversation in
General
Replies: 1 comment 2 replies
-
Some additional info
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I've been hitting a brick wall trying to get the training step to run successfully in Docker WSL in Windows 11 with a 4090.
I can get as far as running
piper_train
but it bombs out with the following after a few seconds (see full error here - piper_train_error.txt).My Dockerfile looks like this (the base image ships with Python 3.8 which wasn't working):
And I then run it with the following:
Before finishing the setup (trying to get the process working before fully containerising the environment).
Adding
torchmetrics==0.11.4
torequirements.txt
to get over theImportError: cannot import name '_compare_version' from 'torchmetrics.utilities.imports'
error as well.And finally running the training.
.venv/bin/python3.10 -m piper_train \ --dataset-dir /training/ \ --accelerator 'gpu' \ --devices 1 \ --batch-size 32 \ --validation-split 0.0 \ --num-test-examples 0 \ --max_epochs 10000 \ --resume_from_checkpoint /checkpoints/epoch%3D9029-step%3D2261720.ckpt \ --checkpoint-epochs 1 \ --precision 32 \ --max-phoneme-ids 400
The pytorch project references the error in pytorch/pytorch#88038, but that jumping to a newer cuda version would pull in torch 2.x which would just get blown away when installing the
piper_train
module.Anything obvious jump out as being wrong in my steps, or can you share the steps that have worked for you so I can reproduce.
Thanks !!
Beta Was this translation helpful? Give feedback.
All reactions