Peteish13 #739

dirkgr · 2024-10-21T22:24:22Z

Peteish13 configs
More options for torch.compile()
Apply compile() to one block at a time.
Fixes to run in the Google cloud
Scripts to run on the Augusta cluster

This reverts commit 6cc6f62.

…to peteish13-augusta

epwalsh · 2024-11-15T20:08:50Z

olmo/train.py

@@ -1036,6 +1036,10 @@ def eval(self) -> Dict[str, Any]:

            del eval_batches

+        # Eval compiles a bunch more versions, and the result is terrible. This way we get back to zero.


What do you that the result is terrible?

so this prompted me to look into this a bit more and I think I've found a better solution: just mark the model input sizes as dynamic. I tested this out in OLMo-core and it appears to work well.
allenai/OLMo-core#105

I think it compiles a bunch of versions for different batch sizes, because that's how we call it during eval, and then they stick around. In all of my early runs I had high tps until the first eval, and then low tps afterwards. This is what fixed it.

I tried dynamic and it was bad. I don't remember the way in which it was bad, but it didn't work. That's why I added that version in the first place.

Ok, oh well. I tested with nightly so maybe it's just better now with recent compiler advances.

dirkgr and others added 30 commits October 3, 2024 16:05

Running on a quarter of the nodes, not half :-(

4824c8b

Can't go this fast

46f907f

rewrite runs

35ae040

Add a config for medium lr

161f59a

New metrics

99aff8e

Adds the "dynamic" option to our torch.compile() support.

89ef109

Stop at a multiple of 1000 steps while we have a perf issue after eval

acf8975

New metrics

0656ce5

compile.dynamic=false

b699753

Don't stop

ca8c485

No more torch version pin

5f1a369

Run scripts to run on Augusta

d6cdc0c

This runs out of memory on eval.

57cc09b

Reproduce the eval we're now missing.

88a06bc

Run 1000 steps

0f6b896

Let's see if dynamic=true solves the problem.

ce11f9f

It did not.

c1d3ffe

Set device when initializing the process group

d3d39a0

Proper way of setting a device_id

d8f2aac

Maybe this incantation

2c0d11a

Don't eval with a compiled model

6cc6f62

Medium LR on Weka

43184f3

Revert "Don't eval with a compiled model"

435c3e6

This reverts commit 6cc6f62.

Let's try this instead.

a74ab6e

Updated metrics

08718b9

This might be faster yet.

b093945

Makes the launch_train.sh script work

51d0a39

Script to run something on all nodes

7cf11d8

Back to regular compile

212f55f

new config

68b022f

dirkgr and others added 22 commits November 12, 2024 10:18

Document what we're doing.

30eb3a6

We can retry again!

f6c4cb8

Run peteish1 on fewer nodes

2ec3fa0

Actually use all the nodes

17b2bca

No retries while we're debugging

eba9075

Try longer to start up

7a35ce9

Anneal the anneal

34d12fe

Annealing config for the 1B

8908d65

Tix fypo

c5b0369

Turn flash attention back on for 7B anneals

b3668f4

No whammy 3 config

d657a19

Urgent

35aaaf3

Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…

7dc73b3

…to peteish13-augusta

added config

bca921f

Run with retries

7b0c0f8

Don't run out of space.

949d80c

Merge branch 'peteish13-augusta' of https://github.com/allenai/LLM in…

ab7d870

…to peteish13-augusta

Remove all Augusta specific stuff

93df396

Fix paths

bad96cd

Remove unused config

214aea5

Delete all the LUMI scripts

200bd1f

Remove metrics notebook

5d8da46

dirkgr requested review from epwalsh and 2015aroras November 15, 2024 19:49

dirkgr added 2 commits November 15, 2024 11:52

Changelog

8b709b9

Productivity through formatting

c7c0c5b

dirkgr marked this pull request as ready for review November 15, 2024 19:57

epwalsh reviewed Nov 15, 2024

View reviewed changes

epwalsh approved these changes Nov 15, 2024

View reviewed changes

Config for more 13B anneals

6f4a49a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peteish13 #739

Peteish13 #739

dirkgr commented Oct 21, 2024

epwalsh Nov 15, 2024

epwalsh Nov 15, 2024

dirkgr Nov 15, 2024

dirkgr Nov 15, 2024

epwalsh Nov 15, 2024

		@@ -1036,6 +1036,10 @@ def eval(self) -> Dict[str, Any]:

		del eval_batches

		# Eval compiles a bunch more versions, and the result is terrible. This way we get back to zero.

Peteish13 #739

Are you sure you want to change the base?

Peteish13 #739

Conversation

dirkgr commented Oct 21, 2024

epwalsh Nov 15, 2024

Choose a reason for hiding this comment

epwalsh Nov 15, 2024

Choose a reason for hiding this comment

dirkgr Nov 15, 2024

Choose a reason for hiding this comment

dirkgr Nov 15, 2024

Choose a reason for hiding this comment

epwalsh Nov 15, 2024

Choose a reason for hiding this comment