Documentation Improvements #745

aman-17 · 2024-11-12T16:00:09Z

Documentation Improvements

Changes Made

Fixed grammar and improved documentation clarity throughout.
Restructured training instructions for better readability.
Enhanced checkpoint download documentation.
- Added new script scripts/download_checkpoints.py to automate checkpoint downloads.
- Removed manual URL conversion by automating R2 to public URL conversion.
Fixed bug for loading unsharded checkpoints in scripts/train.py.
Improved data inspection instructions.

New Features

The new scripts/download_checkpoints.py script:

Automatically handles URL conversions between R2 and public formats.
Downloads checkpoint files with progress tracking.
Supports specific step selection and directory listing.

dirkgr · 2024-11-14T22:25:36Z

.gitignore

-
+readme_misc.md


What is this?

This was my read_me with grammatical mistakes. I will remove it from .gitignore.

dirkgr · 2024-11-14T22:26:59Z

README.md

+```bash
+python scripts/download_checkpoints.py checkpoints/official/OLMo-1B.csv --save-dir ./checkpoints/ --step 2000
+```
+**Note**: All checkpoints in `checkpoints/official/` are unsharded files.


Suggested change

**Note**: All checkpoints in `checkpoints/official/` are unsharded files.

**Note**: All checkpoints in `checkpoints/official/` are unsharded.

They aren't just files. Even in unsharded format, a checkpoint still consists of multiple files.

dirkgr · 2024-11-14T22:27:58Z

README.md


 ```bash
-torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/w1r5xfzt/step1000-unsharded
+torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml     --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test --save_overwrite


Suggested change

torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test --save_overwrite

torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test

Without --save_overwrite, the program throws error.

Only if the directory already exists

dirkgr · 2024-11-14T22:28:45Z

README.md

-We provide tools to do this, but first you'll need to download the data as above (unless you have an R2 API key) and update the corresponding config accordingly.
-
-Then take note of the URL of the data order file you want, which can be found in the [Models Overview](#models-overview) table. For example, the data order file for the first epoch of the OLMo-7B model is [https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy](https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/train_data/global_indices.npy).
+To inspect the exact tokens used in training batches for OLMo models, first download the training data. If you don't have an R2 API key, use the public HTTP URLs and update your configuration file with the local data paths. After completing this setup, you can use the inspection tools to examine the training batches.


Nobody external would ever have an R2 key. I think we can skip that part of the instructions.

dirkgr · 2024-11-14T22:30:15Z

scripts/download_checkpoints.py

+        except requests.exceptions.RequestException:
+            continue


Why would you swallow these exceptions?

Oh, because you don't expect all files to be there? Then at least catch only 404 errors.
But better yet, list the contents of the directory in one call to check what's there, instead of making six calls every time we have to check a directory.

dirkgr · 2024-11-14T22:47:29Z

scripts/download_checkpoints.py

+    parser.add_argument('--save-dir', type=str, default='./checkpoints',
+                        help='Base directory to save downloaded checkpoints')
+    parser.add_argument('--step', type=str, help='Specific step number to download (optional)')
+    parser.add_argument('--list-steps', action='store_true', help='List available step numbers and exit')


If you have a tool that can perform multiple different actions, use subcommands.

dirkgr · 2024-11-14T22:48:26Z

scripts/download_checkpoints.py

+    proceed = input("\nDo you want to proceed with the download? (y/n): ")
+    if proceed.lower() != 'y':
+        print("Download cancelled.")
+        return


No, we don't ask for permission. The tools just do the thing. What if we'd want to script it?

However, that means we have to make sure the tools never do anything dangerous by accident.

dirkgr · 2024-11-14T22:49:16Z

scripts/download_checkpoints.py

+    for step, url in urls:
+        save_path = os.path.join(args.save_dir, f"step{step}")
+        try:
+            download_checkpoint(url, save_path)
+        except Exception as e:
+            print(f"Error during download of step {step}: {e}")


Do you think anyone will ever want to download all steps? That's a lot of data. I think it's better if we give one command to list steps, and another to download one step, and let them deal with the rest.

dirkgr · 2024-11-14T22:53:31Z

scripts/train.py

+                # checkpoint_type = (
+                #     CheckpointType.sharded if cfg.save_num_checkpoints_to_keep != 0 else CheckpointType.unsharded
+                # )
+                checkpoint_type = CheckpointType.unsharded


What's this?

dirkgr · 2024-11-14T22:53:41Z

scripts/train.py

-                sharded_checkpointer=cfg.load_path_sharded_checkpointer,
+                # sharded_checkpointer=cfg.load_path_sharded_checkpointer,
+                sharded_checkpointer= False,
+                checkpoint_type=CheckpointType.unsharded


Same question here

docs: improve documentation

a622fb0

aman-17 added the type/documentation An issue or pull request related to documentation label Nov 12, 2024

aman-17 requested a review from dirkgr November 12, 2024 16:00

aman-17 assigned dirkgr and aman-17 Nov 12, 2024

dirkgr requested changes Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation Improvements #745

Documentation Improvements #745

aman-17 commented Nov 12, 2024

dirkgr Nov 14, 2024

aman-17 Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

aman-17 Nov 14, 2024

dirkgr Nov 15, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

dirkgr Nov 14, 2024

	Note: All checkpoints in `checkpoints/official/` are unsharded files.
	Note: All checkpoints in `checkpoints/official/` are unsharded.

	torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test --save_overwrite
	torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test


		readme_misc.md

Documentation Improvements #745

Are you sure you want to change the base?

Documentation Improvements #745

Conversation

aman-17 commented Nov 12, 2024