Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peteish13 #739

Open
wants to merge 303 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
303 commits
Select commit Hold shift + click to select a range
4824c8b
Running on a quarter of the nodes, not half :-(
dirkgr Oct 3, 2024
46f907f
Can't go this fast
dirkgr Oct 3, 2024
35ae040
rewrite runs
soldni Oct 4, 2024
161f59a
Add a config for medium lr
dirkgr Oct 4, 2024
99aff8e
New metrics
dirkgr Oct 4, 2024
89ef109
Adds the "dynamic" option to our `torch.compile()` support.
dirkgr Oct 4, 2024
acf8975
Stop at a multiple of 1000 steps while we have a perf issue after eval
dirkgr Oct 4, 2024
0656ce5
New metrics
dirkgr Oct 4, 2024
b699753
compile.dynamic=false
dirkgr Oct 4, 2024
ca8c485
Don't stop
dirkgr Oct 4, 2024
5f1a369
No more torch version pin
dirkgr Oct 5, 2024
d6cdc0c
Run scripts to run on Augusta
dirkgr Oct 5, 2024
57cc09b
This runs out of memory on eval.
dirkgr Oct 5, 2024
88a06bc
Reproduce the eval we're now missing.
dirkgr Oct 5, 2024
0f6b896
Run 1000 steps
dirkgr Oct 5, 2024
ce11f9f
Let's see if dynamic=true solves the problem.
dirkgr Oct 5, 2024
c1d3ffe
It did not.
dirkgr Oct 5, 2024
d3d39a0
Set device when initializing the process group
dirkgr Oct 5, 2024
d8f2aac
Proper way of setting a device_id
dirkgr Oct 5, 2024
2c0d11a
Maybe this incantation
dirkgr Oct 5, 2024
6cc6f62
Don't eval with a compiled model
dirkgr Oct 5, 2024
43184f3
Medium LR on Weka
dirkgr Oct 5, 2024
435c3e6
Revert "Don't eval with a compiled model"
dirkgr Oct 5, 2024
a74ab6e
Let's try this instead.
dirkgr Oct 5, 2024
08718b9
Updated metrics
dirkgr Oct 5, 2024
b093945
This might be faster yet.
dirkgr Oct 5, 2024
51d0a39
Makes the launch_train.sh script work
dirkgr Oct 5, 2024
7cf11d8
Script to run something on all nodes
dirkgr Oct 5, 2024
212f55f
Back to regular compile
dirkgr Oct 5, 2024
68b022f
new config
soldni Oct 6, 2024
2594a2a
Show all errors when something fails
dirkgr Oct 6, 2024
4888fab
Peteish13 config for running in Google
dirkgr Oct 6, 2024
59f2247
Specify the number of nodes on the command line
dirkgr Oct 6, 2024
746d394
Specify the number of nodes on the command line, part 2
dirkgr Oct 6, 2024
ea5ca3f
Correct new path of the venv
dirkgr Oct 6, 2024
b52f228
Silence warning
dirkgr Oct 6, 2024
def6987
More informative logs
dirkgr Oct 6, 2024
0b70025
Set the number of nodes correctly
dirkgr Oct 6, 2024
23f743c
Disable files to do a speed test
dirkgr Oct 6, 2024
a9c7daa
Set gcloud project in environment variable
dirkgr Oct 6, 2024
a9bb215
New metrics
dirkgr Oct 6, 2024
174da42
Have to hold on to the GCS client so we don't overwhelm the GCS metad…
dirkgr Oct 6, 2024
1678ef6
Run with NCCL debugging
dirkgr Oct 7, 2024
fce511b
Merge remote-tracking branch 'origin/peteish13' into peteish13-augusta
dirkgr Oct 7, 2024
129426a
Too much NCCL noise
dirkgr Oct 7, 2024
9858048
Merge remote-tracking branch 'origin/peteish13-augusta' into peteish1…
dirkgr Oct 7, 2024
e2a6840
All files are there now.
dirkgr Oct 7, 2024
ca0abef
New metrics
dirkgr Oct 7, 2024
2f9a726
Launch script for a big peteish13-medlr on Augusta
dirkgr Oct 7, 2024
7e8ba3f
This will have to load from the remote save folder. No local storage …
dirkgr Oct 7, 2024
695f53d
Kill other processes before we start
dirkgr Oct 7, 2024
aba734b
Show the output of run_all_nodes properly.
dirkgr Oct 7, 2024
7a76b03
Comment out DCLM to see if this is what kills the startup
dirkgr Oct 7, 2024
2f94412
Revert "Comment out DCLM to see if this is what kills the startup"
dirkgr Oct 7, 2024
1b020e6
added code back
soldni Oct 7, 2024
a424b25
Make the first host configurable
dirkgr Oct 8, 2024
2f3bddf
Scripts for med and high lr
dirkgr Oct 8, 2024
e66bce7
Send the WandB API key
dirkgr Oct 8, 2024
c40635c
mask
soldni Oct 8, 2024
054dc2d
Enable compilation
dirkgr Oct 9, 2024
71284d1
Peteish7 for Google
dirkgr Oct 9, 2024
5871379
New metrics
dirkgr Oct 9, 2024
05ead0e
Make hostfiles an argument to the scripts
dirkgr Oct 9, 2024
e42206c
Launch config for peteish7-highlr
dirkgr Oct 9, 2024
c151681
Load the correct step0
dirkgr Oct 9, 2024
eeba9cc
Don;t bind to CPUs
dirkgr Oct 9, 2024
48f7b78
More nodes
dirkgr Oct 9, 2024
3f7bd8f
Peteish7 with medium LR
dirkgr Oct 11, 2024
2316171
Don't bind to CPUs
dirkgr Oct 11, 2024
10c468b
Make the rendezvous work when the hostfile is jumbled
dirkgr Oct 11, 2024
29b9b9c
Enable NCCL debug
dirkgr Oct 11, 2024
cbe9781
Use GS more
dirkgr Oct 11, 2024
e7c212b
Shuffle nodes
dirkgr Oct 11, 2024
892d224
Eval every 500 steps
dirkgr Oct 11, 2024
5315e68
Eval every 500 steps
dirkgr Oct 11, 2024
f5badaf
New metrics
dirkgr Oct 11, 2024
ce06d18
Launch script to continue the low LR 13B
dirkgr Oct 11, 2024
c2abab4
Make the launch script work even when some nodes are commented out in…
dirkgr Oct 12, 2024
e717f44
Adds all_reduce_bench from Stas
dirkgr Oct 15, 2024
935d695
Fix all_reduce_bench
dirkgr Oct 15, 2024
bc4f6b9
Metrics interval makes a big difference
dirkgr Oct 15, 2024
01669d8
Settings for more fast peteish13
dirkgr Oct 15, 2024
81a2fa9
Several jobs have made progress
dirkgr Oct 15, 2024
f91cebd
Script that checks node pairs for speed
dirkgr Oct 15, 2024
6f068fa
New way of launching jobs for the peteish13 config
dirkgr Oct 15, 2024
78f9c6f
Experimental script to launch jobs without torchrun
dirkgr Oct 15, 2024
b329bab
Switch to the MPI version
dirkgr Oct 15, 2024
b15bcc4
Metrics
dirkgr Oct 15, 2024
aff2124
Commit to hostpatterns
dirkgr Oct 15, 2024
09f447c
run_all_nodes does not need hostpatterns, since it runs on all nodes
dirkgr Oct 15, 2024
e4a103d
Forgot some more settings
dirkgr Oct 15, 2024
c0a130c
Put NCCL_DEBUG somewhere else
dirkgr Oct 15, 2024
21c8fa3
Set the first host properly
dirkgr Oct 15, 2024
f8ef856
Reshuffle for better logs
dirkgr Oct 15, 2024
35300c0
Maybe high ports don't work
dirkgr Oct 15, 2024
c130118
Faster timeouts for debugging
dirkgr Oct 15, 2024
7f12d39
First is first now
dirkgr Oct 15, 2024
a3db348
Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…
dirkgr Oct 15, 2024
3ea5a35
Revert "Faster timeouts for debugging"
dirkgr Oct 15, 2024
88ec63d
Make the launcher a little more flexible
dirkgr Oct 15, 2024
cfb7f4c
new flan cleaned
soldni Oct 16, 2024
19b5ad8
name-fix
soldni Oct 16, 2024
8ff9ebd
More retries for GCS
dirkgr Oct 16, 2024
b79afee
Metrics
dirkgr Oct 16, 2024
ee08b92
Turns out Google APIs don't work that way.
dirkgr Oct 17, 2024
0b3329e
Makes finding the latest checkpoint work on GCS
dirkgr Oct 17, 2024
0263c7a
Remove unused code
dirkgr Oct 17, 2024
3e3e2e3
Fix imports
dirkgr Oct 17, 2024
d24a198
Make the code less readable
dirkgr Oct 17, 2024
82aa577
Make mypy happy
dirkgr Oct 17, 2024
32b869b
more decon
soldni Oct 17, 2024
58cd108
New metrics :-(
dirkgr Oct 19, 2024
5ed69ed
decon-hard-train
soldni Oct 21, 2024
db18672
More metrics
dirkgr Oct 21, 2024
fdc998c
Merge commit '68b022f0bf081891704777f2d48bf836b0934d72' into peteish1…
dirkgr Oct 21, 2024
e90423e
Config for Peteish annealing
dirkgr Oct 23, 2024
4f6e96b
Correct path
dirkgr Oct 23, 2024
b4652d4
Clean up launcher scripts
dirkgr Oct 23, 2024
2144b57
New metrics
dirkgr Oct 23, 2024
ef3dca1
Annealing config for peteish7-highlr
dirkgr Oct 23, 2024
aa84ed4
Running on half the nodes for now
dirkgr Oct 23, 2024
992b884
Merge remote-tracking branch 'origin/peteish13-augusta' into peteish1…
dirkgr Oct 23, 2024
de3142a
Make sure we don't train to 5T tokens
dirkgr Oct 23, 2024
c5f2d47
Anneal script for medium LR
dirkgr Oct 23, 2024
b03f23d
Permissions
dirkgr Oct 24, 2024
b76eb1a
Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…
dirkgr Oct 24, 2024
d22347c
More robust GCS downloads
dirkgr Oct 25, 2024
683afed
Metrics
dirkgr Oct 25, 2024
e52cd73
Config to run the low LR annealing experiment
dirkgr Oct 25, 2024
f6c1eea
Actually go to 4T
dirkgr Oct 28, 2024
05957f2
Metrics
dirkgr Oct 28, 2024
810e23d
Peteish7 XHigh
dirkgr Oct 29, 2024
2d12eb6
Peteish7 XHigh Anneal
dirkgr Oct 29, 2024
61bdf2b
Scripts for anneals on Beakerized Augusta
dirkgr Oct 29, 2024
9692409
Fix dangerous oversight
dirkgr Oct 29, 2024
aa3fd74
100B anneals
dirkgr Oct 29, 2024
cf6a87f
Metrics
dirkgr Oct 29, 2024
37d57f9
Second epoch for the 13B
dirkgr Oct 29, 2024
9015cf1
Merge branch 'peteish13-augusta' of https://github.com/allenai/OLMo i…
dirkgr Oct 29, 2024
76763ad
Inside the container, this path is different
dirkgr Oct 29, 2024
e23a829
Peteish13 anneal with unchanged data
dirkgr Oct 29, 2024
51fa643
Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…
dirkgr Oct 29, 2024
793bf63
Need to specify epoch
dirkgr Oct 29, 2024
ddea087
Metrics
dirkgr Oct 29, 2024
11f11c8
Leftover variable
dirkgr Oct 29, 2024
af2c4fd
One more variable
dirkgr Oct 29, 2024
2bb3de8
Ugh
dirkgr Oct 29, 2024
e69e577
This is why you shouldn't write code when you're tired.
dirkgr Oct 29, 2024
9cda3de
Don't need profiles
dirkgr Oct 29, 2024
e7f3bf5
We should not skip this.
dirkgr Oct 29, 2024
e056744
Install more stuff with conda
dirkgr Oct 29, 2024
8c710e8
Maybe this?
dirkgr Oct 29, 2024
b899c18
combining flan and math
soldni Oct 29, 2024
888a9ff
Pete says this is faster without flash anyways!
dirkgr Oct 29, 2024
1abccdd
Don't install flash
dirkgr Oct 29, 2024
d2ca20c
Launch script for 13B anneals
dirkgr Oct 30, 2024
ab4ceb6
Best guess annealing config for 13B
dirkgr Oct 30, 2024
23f8f14
Better mix
dirkgr Oct 30, 2024
2246140
Merge remote-tracking branch 'origin/main' into peteish13-augusta
dirkgr Oct 30, 2024
43c8480
Better evals for the 13B anneals
dirkgr Oct 30, 2024
9a77b60
Epoch is not 1
dirkgr Oct 30, 2024
eb3d754
Script to clear older steps
dirkgr Oct 30, 2024
eafdc3e
Fix the name, and therefore the place where we save things.
dirkgr Oct 30, 2024
8726e45
Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…
dirkgr Oct 30, 2024
605a3c0
Eval twice as fast
dirkgr Oct 30, 2024
4ea0d94
We need FA after all.
dirkgr Oct 30, 2024
5408e7e
Try to load from a checkpoint
dirkgr Oct 30, 2024
39c964a
Load latest save
dirkgr Oct 30, 2024
6eec513
We have to restore the dataloader now!
dirkgr Oct 30, 2024
7ea0112
Peteish7 10xlr
dirkgr Oct 31, 2024
d7dad11
Continuing this run
dirkgr Oct 31, 2024
7dba688
Merge commit 'b899c18fb92e14641dc84ce2a8ebf167e54d40c8' into peteish1…
dirkgr Nov 1, 2024
8031acd
7B anneal on Augusta/Beaker
dirkgr Nov 1, 2024
7973bef
Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…
dirkgr Nov 1, 2024
c35f8c5
GS authentication is a fucking joke
dirkgr Nov 1, 2024
ae111f4
Revert "GS authentication is a fucking joke"
dirkgr Nov 1, 2024
c141a35
Try load latest save
dirkgr Nov 1, 2024
acba910
Set retries
dirkgr Nov 1, 2024
75a1395
Merge remote-tracking branch 'origin/main' into peteish13-augusta
dirkgr Nov 2, 2024
f8f8a1e
Fix merge gore
dirkgr Nov 2, 2024
4f5866b
So urgent!
dirkgr Nov 2, 2024
21edc84
Don't wait so long for startup
dirkgr Nov 2, 2024
2e1fd4a
Don't overwrite checkpoints!
dirkgr Nov 2, 2024
a1271eb
Cleanup of this config
dirkgr Nov 2, 2024
cb9c22a
Consolidate the 1B config
dirkgr Nov 3, 2024
acf1ccc
Higher priority
dirkgr Nov 3, 2024
3de3c8e
Smaller timeout
dirkgr Nov 3, 2024
e598e0c
Let's just all run with high so we don't preempt each other.
dirkgr Nov 3, 2024
cb740a6
Peteish1 configs
dirkgr Nov 3, 2024
74e30ca
More data loading workers
dirkgr Nov 4, 2024
de19d65
Change peteish7 lr anneals so that they resume from the middle
dirkgr Nov 4, 2024
a6b1c36
With retries!
dirkgr Nov 4, 2024
bdb6c13
Need to specify this now :-(
dirkgr Nov 4, 2024
1ac0437
Bounce epoch
dirkgr Nov 4, 2024
e87265c
Settings for speed!
dirkgr Nov 5, 2024
02ae870
We are urgently important.
dirkgr Nov 5, 2024
62927bd
We are only highly important.
dirkgr Nov 5, 2024
0e8246e
Another config from Pete
dirkgr Nov 5, 2024
8b9b0e4
Try making more GPUs warm
dirkgr Nov 5, 2024
9b1c670
Don't eval so often
dirkgr Nov 5, 2024
f33abac
Even less frequently
dirkgr Nov 5, 2024
cdf4319
High LR config for the 1B
dirkgr Nov 5, 2024
6ca55e7
muP LR
dirkgr Nov 6, 2024
ede2464
Merge remote-tracking branch 'origin/main' into peteish13-augusta
dirkgr Nov 7, 2024
2ba7632
Anneal with different random seeds
dirkgr Nov 7, 2024
57dac1c
Fix settings
dirkgr Nov 7, 2024
d78fb17
Peteish7 medlr on Beaker
dirkgr Nov 7, 2024
2ee05eb
Forgot to substitute variable
dirkgr Nov 7, 2024
6819484
Need different settings for Augusta
dirkgr Nov 7, 2024
e688f07
We can get away with this.
dirkgr Nov 7, 2024
1959960
Peteish13 on Beaker
dirkgr Nov 8, 2024
a489d84
Fix variable
dirkgr Nov 8, 2024
21143d4
Set the remote folder sensibly
dirkgr Nov 8, 2024
cdc544c
Forgot to put the seed in the run name
dirkgr Nov 8, 2024
76e42d7
Wrangle the variables some more
dirkgr Nov 8, 2024
df55ddb
Fix wandb
dirkgr Nov 9, 2024
3dbc034
Peteish13 on Beaker
dirkgr Nov 9, 2024
dbaa442
Another annealing config
dirkgr Nov 9, 2024
2e635cf
Anneal for peteish7 medlr
dirkgr Nov 9, 2024
842ad6a
Actually run to 5T
dirkgr Nov 9, 2024
f18381e
Try with flash
dirkgr Nov 11, 2024
3a2d62e
Dockerfile
dirkgr Nov 12, 2024
0e7b088
Change image
dirkgr Nov 12, 2024
320fd7b
Can't have retries when it's configured like this.
dirkgr Nov 12, 2024
6099826
So urgent!
dirkgr Nov 12, 2024
30eb3a6
Document what we're doing.
dirkgr Nov 12, 2024
f6c4cb8
We can retry again!
dirkgr Nov 12, 2024
2ec3fa0
Run peteish1 on fewer nodes
dirkgr Nov 12, 2024
17b2bca
Actually use all the nodes
dirkgr Nov 12, 2024
eba9075
No retries while we're debugging
dirkgr Nov 12, 2024
7a35ce9
Try longer to start up
dirkgr Nov 12, 2024
34d12fe
Anneal the anneal
dirkgr Nov 13, 2024
8908d65
Annealing config for the 1B
dirkgr Nov 13, 2024
c5b0369
Tix fypo
dirkgr Nov 13, 2024
b3668f4
Turn flash attention back on for 7B anneals
dirkgr Nov 13, 2024
d657a19
No whammy 3 config
Nov 13, 2024
35aaaf3
Urgent
dirkgr Nov 13, 2024
7dc73b3
Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…
dirkgr Nov 13, 2024
bca921f
added config
soldni Nov 14, 2024
7b0c0f8
Run with retries
dirkgr Nov 15, 2024
949d80c
Don't run out of space.
dirkgr Nov 15, 2024
ab7d870
Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…
dirkgr Nov 15, 2024
93df396
Remove all Augusta specific stuff
dirkgr Nov 15, 2024
bad96cd
Fix paths
dirkgr Nov 15, 2024
214aea5
Remove unused config
dirkgr Nov 15, 2024
200bd1f
Delete all the LUMI scripts
dirkgr Nov 15, 2024
5d8da46
Remove metrics notebook
dirkgr Nov 15, 2024
8b709b9
Changelog
dirkgr Nov 15, 2024
c7c0c5b
Productivity through formatting
dirkgr Nov 15, 2024
6f4a49a
Config for more 13B anneals
dirkgr Nov 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `one_in_eight` configuration for activation checkpointing
- New tokenizer in the source instead of from huggingface
- Improved support for GCS
- `torch.compile()` now only compiles each block, not the whole model.
- Support for `torch.compile()` with `dynamic=True`
- Resetting the `torch.compile()` after every evaluation, because evaluation messes with the compiled versions


## [v0.5.1](https://github.com/allenai/OLMo/releases/tag/v0.5.1) - 2024-10-17
Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

1,206 changes: 1,206 additions & 0 deletions configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-refine-rw.yaml

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

1,381 changes: 1,381 additions & 0 deletions configs/peteish1-google.yaml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion configs/peteish1-weka.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ save_num_unsharded_checkpoints_to_keep: -1
load_path: null

max_duration: 1ep
global_train_batch_size: 1024
global_train_batch_size: 512
device_train_microbatch_size: 4

precision: amp_bf16
Expand Down
1,380 changes: 1,380 additions & 0 deletions configs/peteish13-google.yaml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion configs/peteish13-s3.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ save_num_unsharded_checkpoints_to_keep: -1
load_path: null

max_duration: 1ep
global_train_batch_size: 1024
global_train_batch_size: 2048
device_train_microbatch_size: 2

precision: amp_bf16
Expand Down
1,380 changes: 1,380 additions & 0 deletions configs/peteish13-weka.yaml

Large diffs are not rendered by default.

1,382 changes: 1,382 additions & 0 deletions configs/peteish7-google.yaml

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions olmo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -696,6 +696,17 @@ class CompilerConfig(BaseConfig):
The backend to use.
"""

dynamic: Optional[bool] = None
"""
From the torch docs:

Use dynamic shape tracing. When this is True, we will up-front attempt to generate a kernel that is as dynamic
as possible to avoid recompilations when sizes change. This may not always work as some
operations/optimizations will force specialization; use TORCH_LOGS=dynamic to debug overspecialization. When
this is False, we will NEVER generate dynamic kernels, we will always specialize. By default (None), we
automatically detect if dynamism has occurred and compile a more dynamic kernel upon recompile.
"""


class DistributedStrategy(StrEnum):
ddp = "ddp"
Expand Down
4 changes: 4 additions & 0 deletions olmo/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -1036,6 +1036,10 @@ def eval(self) -> Dict[str, Any]:

del eval_batches

# Eval compiles a bunch more versions, and the result is terrible. This way we get back to zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you that the result is terrible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this prompted me to look into this a bit more and I think I've found a better solution: just mark the model input sizes as dynamic. I tested this out in OLMo-core and it appears to work well.
allenai/OLMo-core#105

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it compiles a bunch of versions for different batch sizes, because that's how we call it during eval, and then they stick around. In all of my early runs I had high tps until the first eval, and then low tps afterwards. This is what fixed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried dynamic and it was bad. I don't remember the way in which it was bad, but it didn't work. That's why I added that version in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, oh well. I tested with nightly so maybe it's just better now with recent compiler advances.

if self.cfg.compile is not None:
torch.compiler.reset()

return eval_metrics

def check_if_cancelled(self) -> Tuple[bool, int]:
Expand Down
10 changes: 3 additions & 7 deletions olmo/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,9 +432,7 @@ def _gcs_is_retriable(exception: Exception) -> bool:


def _gcs_upload(source: Path, bucket_name: str, key: str, save_overwrite: bool = False):
from google.cloud import storage as gcs

storage_client = gcs.Client()
storage_client = _get_gcs_client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(key)
if not save_overwrite and blob.exists():
Expand All @@ -444,9 +442,8 @@ def _gcs_upload(source: Path, bucket_name: str, key: str, save_overwrite: bool =

def _gcs_file_size(bucket_name: str, key: str) -> int:
from google.api_core.exceptions import NotFound
from google.cloud import storage as gcs

storage_client = gcs.Client()
storage_client = _get_gcs_client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(key)
try:
Expand All @@ -459,9 +456,8 @@ def _gcs_file_size(bucket_name: str, key: str) -> int:

def _gcs_get_bytes_range(bucket_name: str, key: str, bytes_start: int, num_bytes: int) -> bytes:
from google.api_core.exceptions import NotFound
from google.cloud import storage as gcs

storage_client = gcs.Client()
storage_client = _get_gcs_client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(key)
try:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ requires-python = ">=3.8"
license = { file = "LICENSE" }
dependencies = [
"numpy<2",
"torch>=2.1,<2.5",
"torch>=2.1",
"ai2-olmo-core==0.1.0",
"omegaconf",
"rich",
Expand Down
79 changes: 79 additions & 0 deletions scripts/augusta/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
FROM --platform=linux/amd64 nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04

ARG DEBIAN_FRONTEND="noninteractive"
ENV TZ="America/Los_Angeles"

# Install base tools.
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
jq \
language-pack-en \
make \
sudo \
unzip \
vim \
wget \
parallel \
iputils-ping \
tmux

ARG BEAKER_VERSION
RUN curl --silent \
--connect-timeout 5 \
--max-time 10 \
--retry 5 \
--retry-delay 0 \
--retry-max-time 40 \
--output beaker.tar.gz \
"https://beaker.org/api/v3/release/cli?os=linux&arch=amd64&version=${BEAKER_VERSION}" \
&& tar -zxf beaker.tar.gz -C /usr/local/bin/ ./beaker \
&& rm beaker.tar.gz

# This ensures the dynamic linker (or NVIDIA's container runtime, I'm not sure)
# puts the right NVIDIA things in the right place
ENV NVIDIA_DRIVER_CAPABILITIES=graphics,utility,compute

# Install conda. We give anyone in the users group the ability to run
# conda commands and install packages in the base (default) environment.
# Things installed into the default environment won't persist, but we prefer
# convenience in this case and try to make sure the user is aware of this
# with a message that's printed when the session starts.
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh \
&& echo "32d73e1bc33fda089d7cd9ef4c1be542616bd8e437d1f77afeeaf7afdb019787 Miniconda3-py310_23.1.0-1-Linux-x86_64.sh" \
| sha256sum --check \
&& bash Miniconda3-py310_23.1.0-1-Linux-x86_64.sh -b -p /opt/miniconda3 \
&& rm Miniconda3-py310_23.1.0-1-Linux-x86_64.sh

ENV PATH=/opt/miniconda3/bin:/opt/miniconda3/condabin:$PATH
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

RUN conda install -y pytorch::pytorch==2.5.1 packaging "numpy<2"

# Ensure users can modify their container environment.
RUN echo '%users ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers

# Install MLNX OFED user-space drivers
# See https://docs.nvidia.com/networking/pages/releaseview.action?pageId=15049785#Howto:DeployRDMAacceleratedDockercontaineroverInfiniBandfabric.-Dockerfile
ENV MOFED_VER 5.8-1.1.2.1
ENV OS_VER ubuntu20.04
ENV PLATFORM x86_64
RUN wget --quiet https://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VER}/MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz && \
tar -xvf MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz && \
MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}/mlnxofedinstall --basic --user-space-only --without-fw-update -q && \
rm -rf MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM} && \
rm MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz

RUN apt-get install ninja-build -y

ENV HF_HUB_ENABLE_HF_TRANSFER=1
RUN pip install --no-cache-dir --upgrade pip "setuptools<70.0.0" wheel
# TODO, unpin setuptools when this issue in flash attention is resolved
RUN pip install --no-cache-dir flash-attn==2.6.3 --no-build-isolation
RUN python -c "import torch; print(torch.__version__)"

RUN pip install --no-cache-dir ai2-olmo-core==0.1.0 omegaconf rich boto3 google-cloud-storage tokenizers "cached_path>=1.6.2" transformers importlib_resources py-spy wandb beaker-gantry click torchmetrics safetensors datasets scikit-learn "msgspec>=0.14.0" "smashed[remote]>=0.21.1"

RUN apt-get clean

37 changes: 37 additions & 0 deletions scripts/augusta/peteish1-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=$1
shift

gantry run \
--workspace ai2/13B \
--task-name peteish1 \
--description "Peteish1" \
--priority urgent \
--preemptible \
--beaker-image michalg/cuda11.8-ubuntu20.04-arb \
--cluster ai2/augusta-google-1 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--propagate-failure \
--propagate-preemption \
--synchronized-start-timeout 15m \
--no-python \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env-secret WANDB_API_KEY=DIRKG_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=DIRKG_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=DIRKG_AWS_SECRET_ACCESS_KEY \
--shared-memory 10GiB \
--yes \
--timeout=-1 \
--allow-dirty \
--retries 10 \
-- /bin/bash -c "scripts/augusta/peteish1.sh \$BEAKER_LEADER_REPLICA_HOSTNAME \$BEAKER_REPLICA_RANK"
37 changes: 37 additions & 0 deletions scripts/augusta/peteish1-muplr-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=$1
shift

gantry run \
--workspace ai2/13B \
--task-name peteish1-muplr \
--description "Peteish1 muP LR" \
--priority high \
--preemptible \
--beaker-image michalg/cuda11.8-ubuntu20.04-arb \
--cluster ai2/augusta-google-1 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--propagate-failure \
--propagate-preemption \
--synchronized-start-timeout 15m \
--no-python \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env-secret WANDB_API_KEY=DIRKG_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=DIRKG_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=DIRKG_AWS_SECRET_ACCESS_KEY \
--shared-memory 10GiB \
--yes \
--timeout=-1 \
--allow-dirty \
--retries 10 \
-- /bin/bash -c "scripts/augusta/peteish1-muplr.sh \$BEAKER_LEADER_REPLICA_HOSTNAME \$BEAKER_REPLICA_RANK"
87 changes: 87 additions & 0 deletions scripts/augusta/peteish1-muplr.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/usr/bin/env bash

set -exuo pipefail
IFS=$'\n\t'

BEAKER_LEADER_REPLICA_HOSTNAME=$1
shift

BEAKER_REPLICA_RANK=$1
shift

# augusta specific environment
export LD_LIBRARY_PATH="/var/lib/tcpxo/lib64:${LD_LIBRARY_PATH}"
export NCCL_CROSS_NIC=0
export NCCL_ALGO=Ring,Tree
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=4
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_P2P_PCI_CHUNKSIZE=524288
export NCCL_P2P_NVL_CHUNKSIZE=1048576
export NCCL_FASTRAK_NUM_FLOWS=2
export NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL=0
export NCCL_BUFFSIZE=8388608
export NCCL_FASTRAK_USE_SNAP=1
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING=0
export NCCL_TUNER_PLUGIN=libnccl-tuner.so
export NCCL_TUNER_CONFIG_PATH=/var/lib/tcpxo/lib64/a3plus_tuner_config.textproto
export NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/var/lib/tcpxo/lib64/a3plus_guest_config.textproto
export NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS=600000
export NCCL_NVLS_ENABLE=0
export NCCL_DEBUG=WARN
export NCCL_FASTRAK_CTRL_DEV=enp0s12
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_USE_SNAP=1
export NCCL_FASTRAK_USE_LLCM=1
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices

# Install flash-attn
#conda install -y pytorch-cuda==12.4 packaging ninja cccl cuda-nvcc libcusolver-dev cuda-profiler-api libcusparse-dev libcublas-dev -c pytorch -c nvidia
#pip install flash-attn==2.5.9.post1 --no-build-isolation
pip install '.[train]'
pip freeze

# Force processes to synchronize at init_process_group
export TORCH_DIST_INIT_BARRIER=1
# Better error handling from Python
export PYTHONFAULTHANDLER=1

NAME=${GANTRY_TASK_NAME// /_}
RUN_NAME=$NAME-$(date -u +"%Y%m%d_%H%M%S")
SAVE_FOLDER=/data/$RUN_NAME
mkdir -p $SAVE_FOLDER

torchrun \
--nnodes "${BEAKER_REPLICA_COUNT}:${BEAKER_REPLICA_COUNT}" \
--nproc-per-node 8 \
--rdzv_id 12348 \
--rdzv_backend static \
--rdzv_endpoint "${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" \
--node_rank "${BEAKER_REPLICA_RANK}" \
--rdzv_conf 'read_timeout=420' \
scripts/train.py \
configs/peteish1-google.yaml \
--run_name=$RUN_NAME \
--wandb.group=$NAME \
--optimizer.learning_rate=7.81e-3 \
--save_interval_ephemeral=10000 \
--eval_interval=10000 \
--fsdp.sharding_strategy=HYBRID_SHARD \
--fsdp.hybrid_sharding_num_model_replicas="${BEAKER_REPLICA_COUNT}" \
--fsdp.wrapping_strategy=by_block_and_size \
--save_folder=$SAVE_FOLDER \
--remote_save_folder="gs://ai2-llm/checkpoints/OLMo-medium/$NAME/" \
--try_load_latest_save \
--save_overwrite \
--sharded_checkpointer=olmo_core \
--device_train_microbatch_size=4 \
--device_eval_batch_size=8 \
--compile.fullgraph=false \
--fused_loss=false \
--model.flash_attention=false \
--data.num_workers=32 \
--optimizer.metrics_log_interval=10 \
--data.prefetch_factor=8
41 changes: 41 additions & 0 deletions scripts/augusta/peteish1-seed-anneal-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=$1
shift

NAME=$1
shift

SEED=$1
shift

gantry run \
--workspace ai2/13B \
--task-name $NAME \
--description "Peteish1 annealing : $NAME with seed $SEED" \
--priority urgent \
--preemptible \
--beaker-image dirkg/OLMo \
--cluster ai2/augusta-google-1 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--propagate-failure \
--propagate-preemption \
--synchronized-start-timeout 15m \
--no-python \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env-secret WANDB_API_KEY=DIRKG_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=DIRKG_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=DIRKG_AWS_SECRET_ACCESS_KEY \
--shared-memory 10GiB \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/augusta/peteish1-seed-anneal.sh \$BEAKER_LEADER_REPLICA_HOSTNAME \$BEAKER_REPLICA_RANK $SEED"
Loading
Loading