`gzip` Predicts Data-dependent Scaling Laws

🔗 Multimodal CodeGen for Web Data Extraction

`gzip` Predicts Data-dependent Scaling Laws

This is the official code for gzip Predicts Data-dependent Scaling Laws (under review at NeurIPS 2024).

We find that:

scaling laws are sensitive to differences in data complexity
gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties

Our data-dependent scaling law's compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes more complex (harder to compress).

Code Overview

data_gen.py: create PCFGs with specified syntactic properties and sample text datasets from them
data_utils.py: gzip-compressibility measurement, tokenization & HuggingFace tooling, dataloaders, etc.
training.py: run a single training run given model and dataset, returning loss at each train step
main.py: run a set of training runs across datasets & model sizes (hackily GPU-parallelized with threading)
fsdp_training.py: for running bigger jobs with cleaner data loading & FSDP training

Upon request via email, we can also provide:

JSONL records of all training runs (this is large and can't fit on GitHub)
the Jupyter Notebook used to fit scaling laws from training runs and generate all visuals

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_gen.py		data_gen.py
data_utils.py		data_utils.py
fsdp_training.py		fsdp_training.py
gzip_difficulty.py		gzip_difficulty.py
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`gzip` Predicts Data-dependent Scaling Laws

Code Overview

About

Releases

Packages

Languages

License

KhoomeiK/complexity-scaling

Folders and files

Latest commit

History

Repository files navigation

gzip Predicts Data-dependent Scaling Laws

Code Overview

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`gzip` Predicts Data-dependent Scaling Laws

Packages