CHTC · agitter · Sep 15, 2023 · Aug 11, 2023 · Aug 11, 2023 · Aug 18, 2023
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.vscode
+llm/.env
+llm/condor_log/*.txt
diff --git a/llm/.env.example b/llm/.env.example
@@ -0,0 +1,21 @@
+# CHTC staging directory
+# This is necessary for saving your checkpoints
+STAGING_DIR=/staging/your-uid/your-project-dir
+
+
+# WANDB (Optional)
+# If you need to use wandb for tracking, set all WANDB_XXX variables
+# You can obtain your API key at https://wandb.ai/authorize
+WANDB_API_KEY=1234567890
+# You can check your user name at https://wandb.ai/settings
+WANDB_ENTITY=your-wandb-user-name
+WANDB_PROJECT=your-project-name
+
+
+# Github container registry credentials (Optional)
+# If you want to build your own container and store it on Github container registry, you have to set the below variables
+# see: https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry
+# The variables are used in the build_push_container.sh script.
+CR_PAT=1234
+GH_USERNAME=your-github-username
+GH_CONTAINER_NAME=your-container-name
diff --git a/llm/Dockerfile b/llm/Dockerfile
@@ -0,0 +1,7 @@
+FROM huggingface/transformers-pytorch-gpu
+
+RUN pip install --upgrade pip
+
+COPY requirements.txt /tmp/requirements.txt
+
+RUN pip install -r /tmp/requirements.txt
diff --git a/llm/README.md b/llm/README.md
@@ -0,0 +1,79 @@
+# Personal CHTC Submit Template for LLM Fine-Tuning
+
+Use Case: Fine-tune large language models on CHTC and optionally monitor the process with Weights & Biases. This example is based on the example provided at [Hugging Face Documentation](https://huggingface.co/docs/transformers/training)
+
+![WANDB](wandb.png)
+
+## Quick start
+
+1. Store your WANDB credentials and `STAGING_DIR` path in an environment file named .env. See the provided [example](.env.example). (If you do not have a CHTC `/staging` directory, contact the facilitation team at [email protected], as described [here](https://chtc.cs.wisc.edu/uw-research-computing/file-avail-largedata)). 
+1. Update the `run_name` in the submit file (in the `arguments = ` line). This will be utilized as the WANDB tracking ID, and checkpoints will be saved in `STAGING_DIR/results/run_name/`.
+1. (Optional) Build your own training container, see details below.
+1. Modify `run.sub` as necessary.
+1. Create a `condor_log` directory using the command: `mkdir condor_log` if you don't have it.
+1. Submit your job using `condor_submit run.sub`.
+
+## Building your own container (Optional)
+
+Note: Perform this step on your local machine, not on a CHTC submit node.
+
+Example resources for building a training container:
+
+- [Dockerfile](Dockerfile)
+- [requirements.txt](requirements.txt)
+- [Helper script](build_push_container.sh)
+- [.env](.env.example)
+
+Users should consider building their own container to match their specific needs.
+
+Example Container Image:
+
+- [Link](https://github.com/users/jasonlo/packages/container/package/chtc_condor)
+
+## Used Stacks
+
+- Docker
+- Github Container Registry (ghcr.io)
+- Huggingface Transformers
+- Weights & Biases (WANDB)
+
+## Used CHTC/HTCondor Features
+
+- Docker Universe
+- Checkpointing
+- Staging (for storing checkpoints)
+- GPU
+
+## FAQ
+
+1. Why shouldn't I run python run.py directly in run.sub?
+
+> I need to export the HuggingFace cache directory to _CONDOR_SCRATCH_DIR in a global scope. I'm unaware of a simple method to do this in python. Please let me know if you have a solution.
+
+1. Why is `+GPUJobLength = "short"` present in `run.sub`?
+
+> The queuing duration for `long` is excessive, and since we perform checkpointing, it's more efficient to use `short`. CHTC [policy](https://chtc.cs.wisc.edu/uw-research-computing/gpu-jobs) also allows users to run far more simultaneous `short` jobs than `long` jobs.
+
+1. Can I use additional GPUs?
+
+> Absolutely! Just modify the `request_gpus` value in `run.sub` to your desired number. HuggingFace's [trainer](https://huggingface.co/docs/transformers/main_classes/trainer) will then automatically use all available GPUs.
+
+1. How long does the model train?
+
+> This example trains for a single epoch. In a research setting, you would modify the fine-tuning to train for more epochs or until training converges.
+
+## To-Do list
+
+- Consolidate all configurations into a single location? They are currently dispersed across `.env`, `run.sh`, and `run.sub`.
+- Implement `wandb` hyperparameter `sweep` functionality.
+- Integrate `DeepSpeed` support.
+- Is it feasible or quicker to store the Docker image in `staging`?
+- Experiment with a training-optimized container, such as [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch).
+
+## Disclaimer 
+
+Please note that while Weights & Biases is a popular third-party service for logging and visualizing model training, it is not officially supported by CHTC. Though included in this example, its use does not constitute an official endorsement, and users must troubleshoot any W&B issues independently.
+
+## About the author
+
+Contributed by [Jason from Data Science Institute, UW-Madison](https://datascience.wisc.edu/staff/lo-jason/).
diff --git a/llm/build_push_container.sh b/llm/build_push_container.sh
@@ -0,0 +1,9 @@
+# Get secrets (CR_PAT)
+source .env
+
+# Login
+echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin
+
+# Build and push
+docker build -t ghcr.io/$GH_USERNAME/$GH_CONTAINER_NAME:latest .
+docker push ghcr.io/$GH_USERNAME/$GH_CONTAINER_NAME:latest
diff --git a/llm/condor_log/.gitkeep b/llm/condor_log/.gitkeep
diff --git a/llm/requirements.txt b/llm/requirements.txt
@@ -0,0 +1,2 @@
+python-dotenv==1.0.0
+wandb==0.15.8
diff --git a/llm/run.sh b/llm/run.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+echo "Running job on `hostname`"
+echo "GPUs assigned: $CUDA_VISIBLE_DEVICES"
+echo "Run name: $1"
+echo "Use wandb: $2"
+
+export TRANSFORMERS_CACHE=$_CONDOR_SCRATCH_DIR/models
+export HF_DATASETS_CACHE=$_CONDOR_SCRATCH_DIR/datasets
+export HF_MODULES_CACHE=$_CONDOR_SCRATCH_DIR/modules
+export HF_METRICS_CACHE=$_CONDOR_SCRATCH_DIR/metrics
+
+python3 train.py $1 $2
diff --git a/llm/run.sub b/llm/run.sub
@@ -0,0 +1,40 @@
+JobBatchName            = "LLM training template"
+# Update your run name here and whether to use wandb
+arguments               = demo_run --use_wandb
+
+universe                = docker
+docker_image            = ghcr.io/jasonlo/chtc_condor:latest
+docker_network_type     = host
+
+# Artefact
+Requirements            = (Target.HasCHTCStaging == true)
+executable              = run.sh
+transfer_input_files    = train.py, .env
+should_transfer_files   = YES
+
+# Checkpoint
+checkpoint_exit_code    = 85
++is_resumable           = true
+
+# Logging
+stream_output           = true
+output                  = condor_log/output.$(Cluster)-$(Process).txt
+error                   = condor_log/error.$(Cluster)-$(Process).txt
+log                     = condor_log/log.$(Cluster)-$(Process).txt
+
+# Compute resources
+request_cpus            = 2
+request_memory          = 8GB
+request_disk            = 100GB
+
+# Extra GPU settings
+request_gpus            = 2
+Requirements            = (Target.CUDADriverVersion >= 10.1)
++WantGPULab             = true
+# change to true if *not* using staging for checkpoints and interested in accessing GPUs beyond CHTC
++WantFlocking           = false
++WantGlidein            = false
++GPUJobLength           = "short"
+
+# Runs
+queue 1
diff --git a/llm/train.py b/llm/train.py
@@ -0,0 +1,88 @@
+import argparse
+import logging
+import os
+from pathlib import Path
+
+import wandb
+from datasets import load_dataset
+from dotenv import load_dotenv
+from transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    Trainer,
+    TrainingArguments,
+)
+from transformers.trainer_utils import get_last_checkpoint
+
+
+def train(run_name: str, use_wandb: bool = False):
+    """Test training script with basic wandb logging."""
+
+    STAGING_DIR = Path(os.getenv("STAGING_DIR"))
+    RESULTS_DIR = STAGING_DIR / "results" / run_name
+
+    if use_wandb:
+        print("Using wandb for logging.")
+        wandb.init(name=run_name, id=run_name, resume="allow")
+    else:
+        print("Not using wandb for logging.")
+
+    # Main training section
+    dataset = load_dataset("yelp_review_full")
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+
+    def tokenize_function(examples):
+        return tokenizer(examples["text"], padding="max_length", truncation=True)
+
+    dataset = dataset.map(tokenize_function, batched=True)
+
+    train_dataset = dataset["train"].shuffle(seed=42)
+    eval_dataset = dataset["test"].shuffle(seed=42)
+
+    model = AutoModelForSequenceClassification.from_pretrained(
+        "bert-base-cased", num_labels=5
+    )
+
+    training_args = TrainingArguments(
+        output_dir=RESULTS_DIR,
+        evaluation_strategy="steps",
+        num_train_epochs=1,
+        report_to="wandb" if use_wandb else "none",
+        save_strategy="steps",
+        save_total_limit=3,
+        # deepspeed="deepspeed_config.json",
+    )
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+    )
+
+    last_checkpoint = get_last_checkpoint(training_args.output_dir)
+    trainer.train(resume_from_checkpoint=last_checkpoint)
+
+    if use_wandb:
+        wandb.finish()
+
+
+def main():
+    """Run training script."""
+
+    load_dotenv()
+    logging.basicConfig(level=logging.INFO)
+
+    # Get arguments from command line
+    parser = argparse.ArgumentParser()
+    parser.add_argument("run_name", type=str, help="Name of run.")
+    parser.add_argument(
+        "-w", "--use_wandb", action="store_true", help="Use wandb for logging."
+    )
+    args = parser.parse_args()
+
+    train(args.run_name, args.use_wandb)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/wandb.png b/llm/wandb.png