We use the template from https://github.com/ashleve/lightning-hydra-template
.
Please read the instructions there to understand the repo structure.
To train GPT2 on Openwebtext with 8 GPUs:
python run.py experiment=owt/gpt2s-flash trainer.devices=8
python run.py experiment=owt/gpt2m-flash trainer.devices=8
python run.py experiment=owt/gpt2l-flash trainer.devices=8
To train with bf16 instead of fp16, add trainer.precision=bf16
.
Python 3.8+, Pytorch 1.9+, torchvision, torchtext, pytorch-fast-transformers, munch, einops, timm, hydra-core, hydra-colorlog, python-dotenv, rich, pytorch-lightning, triton. We recommend CUDA 11.8 (e.g., using the Nvidia's Pytorch Docker image from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
We provide a Dockerfile that lists all the required packages.
This repo includes the following CUDA extensions:
- Fused dropout + residual + LayerNorm, adapted from Apex's FastLayerNorm.
cd csrc/layer_norm && pip install .
- Fused matmul + bias (forward and backward), and fused matmul + bias + gelu (forward and backward), adapted from Apex's FusedDense.
cd csrc/fused_dense_lib && pip install .
- Optimized cross-entropy loss, adapted from Apex's Xentropy.
cd csrc/xentropy && pip install .