Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLM example #28

Merged
merged 15 commits into from
Sep 15, 2023
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.vscode
llm/.env
llm/condor_log/*.txt
21 changes: 21 additions & 0 deletions llm/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# CHTC staging directory
# This is necessary for saving your checkpoints
STAGING_DIR=/staging/your-uid/your-project-dir


# WANDB (Optional)
# If you need to use wandb for tracking, set all WANDB_XXX variables
# You can obtain your API key at https://wandb.ai/authorize
WANDB_API_KEY=1234567890
JasonLo marked this conversation as resolved.
Show resolved Hide resolved
# You can check your user name at https://wandb.ai/settings
WANDB_ENTITY=your-wandb-user-name
WANDB_PROJECT=your-project-name


# Github container registry credentials (Optional)
# If you want to build your own container and store it on Github container registry, you have to set the below variables
# see: https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry
# The variables are used in the build_push_container.sh script.
CR_PAT=1234
GH_USERNAME=your-github-username
GH_CONTAINER_NAME=your-container-name
7 changes: 7 additions & 0 deletions llm/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM huggingface/transformers-pytorch-gpu

RUN pip install --upgrade pip

COPY requirements.txt /tmp/requirements.txt

RUN pip install -r /tmp/requirements.txt
79 changes: 79 additions & 0 deletions llm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Personal CHTC Submit Template for LLM Fine-Tuning

Use Case: Fine-tune large language models on CHTC and optionally monitor the process with Weights & Biases. This example is based on the example provided at [Hugging Face Documentation](https://huggingface.co/docs/transformers/training)

![WANDB](wandb.png)

## Quick start

1. Store your WANDB credentials and `STAGING_DIR` path in an environment file named .env. See the provided [example](.env.example). (If you do not have a CHTC `/staging` directory, contact the facilitation team at [email protected], as described [here](https://chtc.cs.wisc.edu/uw-research-computing/file-avail-largedata)).
1. Update the `run_name` in the submit file (in the `arguments = ` line). This will be utilized as the WANDB tracking ID, and checkpoints will be saved in `STAGING_DIR/results/run_name/`.
1. (Optional) Build your own training container, see details below.
1. Modify `run.sub` as necessary.
1. Create a `condor_log` directory using the command: `mkdir condor_log` if you don't have it.
1. Submit your job using `condor_submit run.sub`.

## Building your own container (Optional)

Note: Perform this step on your local machine, not on a CHTC submit node.

Example resources for building a training container:

- [Dockerfile](Dockerfile)
- [requirements.txt](requirements.txt)
- [Helper script](build_push_container.sh)
- [.env](.env.example)

Users should consider building their own container to match their specific needs.

Example Container Image:

- [Link](https://github.com/users/jasonlo/packages/container/package/chtc_condor)

## Used Stacks

- Docker
- Github Container Registry (ghcr.io)
- Huggingface Transformers
- Weights & Biases (WANDB)

## Used CHTC/HTCondor Features

- Docker Universe
- Checkpointing
- Staging (for storing checkpoints)
ChristinaLK marked this conversation as resolved.
Show resolved Hide resolved
- GPU

## FAQ

1. Why shouldn't I run python run.py directly in run.sub?

> I need to export the HuggingFace cache directory to _CONDOR_SCRATCH_DIR in a global scope. I'm unaware of a simple method to do this in python. Please let me know if you have a solution.

1. Why is `+GPUJobLength = "short"` present in `run.sub`?

> The queuing duration for `long` is excessive, and since we perform checkpointing, it's more efficient to use `short`. CHTC [policy](https://chtc.cs.wisc.edu/uw-research-computing/gpu-jobs) also allows users to run far more simultaneous `short` jobs than `long` jobs.

1. Can I use additional GPUs?

> Absolutely! Just modify the `request_gpus` value in `run.sub` to your desired number. HuggingFace's [trainer](https://huggingface.co/docs/transformers/main_classes/trainer) will then automatically use all available GPUs.

1. How long does the model train?

> This example trains for a single epoch. In a research setting, you would modify the fine-tuning to train for more epochs or until training converges.

## To-Do list
agitter marked this conversation as resolved.
Show resolved Hide resolved

- Consolidate all configurations into a single location? They are currently dispersed across `.env`, `run.sh`, and `run.sub`.
- Implement `wandb` hyperparameter `sweep` functionality.
- Integrate `DeepSpeed` support.
ChristinaLK marked this conversation as resolved.
Show resolved Hide resolved
- Is it feasible or quicker to store the Docker image in `staging`?
- Experiment with a training-optimized container, such as [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch).

## Disclaimer

Please note that while Weights & Biases is a popular third-party service for logging and visualizing model training, it is not officially supported by CHTC. Though included in this example, its use does not constitute an official endorsement, and users must troubleshoot any W&B issues independently.

## About the author

Contributed by [Jason from Data Science Institute, UW-Madison](https://datascience.wisc.edu/staff/lo-jason/).
9 changes: 9 additions & 0 deletions llm/build_push_container.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Get secrets (CR_PAT)
source .env

# Login
echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin

# Build and push
docker build -t ghcr.io/$GH_USERNAME/$GH_CONTAINER_NAME:latest .
docker push ghcr.io/$GH_USERNAME/$GH_CONTAINER_NAME:latest
Empty file added llm/condor_log/.gitkeep
Empty file.
2 changes: 2 additions & 0 deletions llm/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
python-dotenv==1.0.0
wandb==0.15.8
13 changes: 13 additions & 0 deletions llm/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

echo "Running job on `hostname`"
echo "GPUs assigned: $CUDA_VISIBLE_DEVICES"
echo "Run name: $1"
echo "Use wandb: $2"

export TRANSFORMERS_CACHE=$_CONDOR_SCRATCH_DIR/models
export HF_DATASETS_CACHE=$_CONDOR_SCRATCH_DIR/datasets
export HF_MODULES_CACHE=$_CONDOR_SCRATCH_DIR/modules
export HF_METRICS_CACHE=$_CONDOR_SCRATCH_DIR/metrics

python3 train.py $1 $2
40 changes: 40 additions & 0 deletions llm/run.sub
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
JobBatchName = "LLM training template"
# Update your run name here and whether to use wandb
arguments = demo_run --use_wandb

JasonLo marked this conversation as resolved.
Show resolved Hide resolved
universe = docker
docker_image = ghcr.io/jasonlo/chtc_condor:latest
agitter marked this conversation as resolved.
Show resolved Hide resolved
docker_network_type = host

# Artefact
Requirements = (Target.HasCHTCStaging == true)
executable = run.sh
transfer_input_files = train.py, .env
should_transfer_files = YES

# Checkpoint
checkpoint_exit_code = 85
+is_resumable = true

# Logging
stream_output = true
output = condor_log/output.$(Cluster)-$(Process).txt
error = condor_log/error.$(Cluster)-$(Process).txt
log = condor_log/log.$(Cluster)-$(Process).txt

# Compute resources
request_cpus = 2
request_memory = 8GB
request_disk = 100GB

# Extra GPU settings
request_gpus = 2
Requirements = (Target.CUDADriverVersion >= 10.1)
+WantGPULab = true
# change to true if *not* using staging for checkpoints and interested in accessing GPUs beyond CHTC
+WantFlocking = false
+WantGlidein = false
+GPUJobLength = "short"

# Runs
queue 1
88 changes: 88 additions & 0 deletions llm/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import argparse
import logging
import os
from pathlib import Path

import wandb
from datasets import load_dataset
from dotenv import load_dotenv
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from transformers.trainer_utils import get_last_checkpoint


def train(run_name: str, use_wandb: bool = False):
"""Test training script with basic wandb logging."""

STAGING_DIR = Path(os.getenv("STAGING_DIR"))
RESULTS_DIR = STAGING_DIR / "results" / run_name

if use_wandb:
print("Using wandb for logging.")
wandb.init(name=run_name, id=run_name, resume="allow")
else:
print("Not using wandb for logging.")

# Main training section
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = dataset.map(tokenize_function, batched=True)

train_dataset = dataset["train"].shuffle(seed=42)
eval_dataset = dataset["test"].shuffle(seed=42)

model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-cased", num_labels=5
)

training_args = TrainingArguments(
output_dir=RESULTS_DIR,
evaluation_strategy="steps",
num_train_epochs=1,
report_to="wandb" if use_wandb else "none",
save_strategy="steps",
save_total_limit=3,
# deepspeed="deepspeed_config.json",
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)

last_checkpoint = get_last_checkpoint(training_args.output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)

if use_wandb:
wandb.finish()


def main():
"""Run training script."""

load_dotenv()
logging.basicConfig(level=logging.INFO)

# Get arguments from command line
parser = argparse.ArgumentParser()
parser.add_argument("run_name", type=str, help="Name of run.")
parser.add_argument(
"-w", "--use_wandb", action="store_true", help="Use wandb for logging."
)
args = parser.parse_args()

train(args.run_name, args.use_wandb)


if __name__ == "__main__":
main()
Binary file added llm/wandb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.