Skip to content

KhoomeiK/complexity-scaling

Repository files navigation

Comparison of parameter-data scaling contours for datasets of 2 different gzip-compressibilities

🐦 Twitter   •   📄 Arxiv   •   🤗 Datasets

🔗 Multimodal CodeGen for Web Data Extraction

gzip Predicts Data-dependent Scaling Laws

This is the official code for gzip Predicts Data-dependent Scaling Laws (under review at NeurIPS 2024).

We find that:

  1. scaling laws are sensitive to differences in data complexity
  2. gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties

Our data-dependent scaling law's compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes more complex (harder to compress).

Code Overview

  • data_gen.py: create PCFGs with specified syntactic properties and sample text datasets from them
  • data_utils.py: gzip-compressibility measurement, tokenization & HuggingFace tooling, dataloaders, etc.
  • training.py: run a single training run given model and dataset, returning loss at each train step
  • main.py: run a set of training runs across datasets & model sizes (hackily GPU-parallelized with threading)
  • fsdp_training.py: for running bigger jobs with cleaner data loading & FSDP training

Upon request via email, we can also provide:

  • JSONL records of all training runs (this is large and can't fit on GitHub)
  • the Jupyter Notebook used to fit scaling laws from training runs and generate all visuals

About

gzip Predicts Data-dependent Scaling Laws

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages