Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLM example #28

Merged
merged 15 commits into from
Sep 15, 2023
Merged

Add LLM example #28

merged 15 commits into from
Sep 15, 2023

Conversation

JasonLo
Copy link
Contributor

@JasonLo JasonLo commented Aug 11, 2023

Thanks for the amazing OSG 2023 workshop.
Hopefully this example is helpful.

@agitter
Copy link
Contributor

agitter commented Aug 11, 2023

Thanks a lot for the contribution @JasonLo. Fine-tuning LLMs should be a high-demand example.

We'll discuss who can review this.

Copy link
Contributor

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the excellent example. I read through all the files to get familiar with the organization and left initial comments. I still haven't run the example and may not have staging access set up on my account.

I'd like CHTC feedback (@ChristinaLK or @jhiemstrawisc perhaps?) as well. One of us should also independently confirm we can run the example.

llm/README.md Outdated Show resolved Hide resolved
llm/run.sub Outdated Show resolved Hide resolved
llm/run.sub Show resolved Hide resolved
llm/requirements.txt Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/run.sh Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated
- [Helper script](build.sh)
- [.env](.env.example) for Github container registry credentials (`CR_PAT`)

Users should consider building their own container to match their specific needs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this step is optional, we can move this to the top and explicitly note it as an optional step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also mention that when I tested building my own container, a) it is a whopping 22GB and b) github defaulted to making it private so that my job couldn't fetch it. Is the token in .env supposed to be passed to condor somehow so it can fetch the ghcr container? Otherwise, we should mention the step of making sure the container is updated to public. I did verify that the token has write/read: packages set, so it should have the permission it needs to get the container.

@ChristinaLK As for the container being quite large, which is better -- using condor to move a large container over the network by pulling straight from ghcr.io, or should we set up /staging/gpu-examples like we did with /squid/gpu-examples for moving other large execution environments? (I assume 22GB is too large for squid to handle)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be aware that using a private container may incur charges. More details can be found at GitHub's billing information for packages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a whopping 22GB

How did it get so big? The latest version of huggingface/transformers-pytorch-gpu has a compressed size of 10.38 GB on DockerHub. I would expect it already had all the large dependencies like CUDA, PyTorch, etc.

https://hub.docker.com/r/huggingface/transformers-pytorch-gpu/tags shows recent versions that are smaller, only 8 GB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, the Dockerfile only install a few small python dependencies on top of the original huggingface/transformers-pytorch-gpu. I had tried to rebuild it today, but the size is still around 22 GB... Maybe they are using some more aggressive compression? Feel free to swap my image to any CHTC provided one if you see fit.

llm/README.md Show resolved Hide resolved
llm/.env.example Outdated Show resolved Hide resolved
Copy link
Contributor Author

@JasonLo JasonLo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing some comments.

llm/run.sub Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated
- [Helper script](build.sh)
- [.env](.env.example) for Github container registry credentials (`CR_PAT`)

Users should consider building their own container to match their specific needs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be aware that using a private container may incur charges. More details can be found at GitHub's billing information for packages.

llm/.env.example Show resolved Hide resolved
llm/.env.example Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/requirements.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments based on my testing.

llm/run.sh Outdated Show resolved Hide resolved
llm/run.sh Outdated Show resolved Hide resolved
llm/run.sub Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
JasonLo and others added 5 commits August 25, 2023 09:47
Co-authored-by: Anthony Gitter <[email protected]>
Co-authored-by: Anthony Gitter <[email protected]>
Co-authored-by: Anthony Gitter <[email protected]>
Co-authored-by: Anthony Gitter <[email protected]>
@agitter
Copy link
Contributor

agitter commented Sep 7, 2023

Thanks for continuing to make changes @JasonLo.

This comment from my testing was buried in a commit suggestion. When I removed the --use_wandb argument for my testing, it did not disable logging. My error file contained

wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)
wandb: ERROR The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main()
  File "train.py", line 80, in main
    train(args.run_name, args.use_wandb)
  File "train.py", line 61, in train
    trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/transformers/src/transformers/trainer.py", line 1544, in train
    return inner_training_loop(
  File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin
    self.setup(args, state, model, **kwargs)
  File "/transformers/src/transformers/integrations.py", line 734, in setup
    self._wandb.init(
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    raise e
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init
    run = wi.init()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init
    raise error
wandb.errors.AuthenticationError: The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)

Have you gotten this to run without the W&B logging?

Copy link
Collaborator

@ChristinaLK ChristinaLK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JasonLo finally reviewing this...sorry for the delay. Mainly adding little bits and bobs.

llm/run.sh Outdated Show resolved Hide resolved
llm/run.sub Outdated Show resolved Hide resolved
llm/run.sub Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
llm/README.md Show resolved Hide resolved
llm/run.sub Outdated Show resolved Hide resolved
llm/run.sub Outdated Show resolved Hide resolved
@JasonLo
Copy link
Contributor Author

JasonLo commented Sep 13, 2023

Thanks for continuing to make changes @JasonLo.

This comment from my testing was buried in a commit suggestion. When I removed the --use_wandb argument for my testing, it did not disable logging. My error file contained

wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)
wandb: ERROR The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main()
  File "train.py", line 80, in main
    train(args.run_name, args.use_wandb)
  File "train.py", line 61, in train
    trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/transformers/src/transformers/trainer.py", line 1544, in train
    return inner_training_loop(
  File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin
    self.setup(args, state, model, **kwargs)
  File "/transformers/src/transformers/integrations.py", line 734, in setup
    self._wandb.init(
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    raise e
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init
    run = wi.init()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init
    raise error
wandb.errors.AuthenticationError: The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)

Have you gotten this to run without the W&B logging?

This is fixed in c759b99

I've tested it; now it won't accidentally activate wandb.
Job ID run with c759b99: 17010138

@JasonLo
Copy link
Contributor Author

JasonLo commented Sep 13, 2023

@agitter @ChristinaLK
No more changes from my end. Your team can handle the merge.

Copy link
Contributor

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @JasonLo and @ChristinaLK. My test job 17011164.0 is now training successfully without W&B logging. I'll check on it Friday and merge if everything is still looking good.

I didn't see any training convergence criteria. Does this continue saving checkpoints and training until the short GPU job limit is reached? That's fine and doesn't need to prevent merging.

@JasonLo
Copy link
Contributor Author

JasonLo commented Sep 14, 2023

I didn't see any training convergence criteria. Does this continue saving checkpoints and training until the short GPU job limit is reached? That's fine and doesn't need to prevent merging.

I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources. You can see this setting in the code here: GitHub Link to train.py.

llm/README.md Outdated Show resolved Hide resolved
llm/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources.

Thanks, I see that now. I added a FAQ question about this.

My test job finished, and a second test job seemed to use the checkpoint because it finished quickly. I'll merge.

llm/README.md Outdated Show resolved Hide resolved
llm/README.md Show resolved Hide resolved
@agitter agitter merged commit f062490 into CHTC:master Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants