Add LLM example #28

JasonLo · 2023-08-11T21:33:12Z

Thanks for the amazing OSG 2023 workshop.
Hopefully this example is helpful.

agitter · 2023-08-11T22:06:58Z

Thanks a lot for the contribution @JasonLo. Fine-tuning LLMs should be a high-demand example.

We'll discuss who can review this.

agitter

Thanks again for the excellent example. I read through all the files to get familiar with the organization and left initial comments. I still haven't run the example and may not have staging access set up on my account.

I'd like CHTC feedback (@ChristinaLK or @jhiemstrawisc perhaps?) as well. One of us should also independently confirm we can run the example.

llm/README.md

llm/run.sub

llm/requirements.txt

llm/README.md

llm/run.sh

llm/README.md

agitter · 2023-08-17T14:49:09Z

llm/README.md

+- [Helper script](build.sh)
+- [.env](.env.example) for Github container registry credentials (`CR_PAT`)
+
+Users should consider building their own container to match their specific needs.


If this step is optional, we can move this to the top and explicitly note it as an optional step.

I should also mention that when I tested building my own container, a) it is a whopping 22GB and b) github defaulted to making it private so that my job couldn't fetch it. Is the token in .env supposed to be passed to condor somehow so it can fetch the ghcr container? Otherwise, we should mention the step of making sure the container is updated to public. I did verify that the token has write/read: packages set, so it should have the permission it needs to get the container.

@ChristinaLK As for the container being quite large, which is better -- using condor to move a large container over the network by pulling straight from ghcr.io, or should we set up /staging/gpu-examples like we did with /squid/gpu-examples for moving other large execution environments? (I assume 22GB is too large for squid to handle)

Please be aware that using a private container may incur charges. More details can be found at GitHub's billing information for packages.

it is a whopping 22GB

How did it get so big? The latest version of huggingface/transformers-pytorch-gpu has a compressed size of 10.38 GB on DockerHub. I would expect it already had all the large dependencies like CUDA, PyTorch, etc.

https://hub.docker.com/r/huggingface/transformers-pytorch-gpu/tags shows recent versions that are smaller, only 8 GB.

Yes you are right, the Dockerfile only install a few small python dependencies on top of the original huggingface/transformers-pytorch-gpu. I had tried to rebuild it today, but the size is still around 22 GB... Maybe they are using some more aggressive compression? Feel free to swap my image to any CHTC provided one if you see fit.

llm/.env.example

llm/README.md

llm/.env.example

JasonLo

Addressing some comments.

llm/run.sub

llm/README.md

JasonLo · 2023-08-18T01:44:04Z

llm/README.md

+- [Helper script](build.sh)
+- [.env](.env.example) for Github container registry credentials (`CR_PAT`)
+
+Users should consider building their own container to match their specific needs.


Please be aware that using a private container may incur charges. More details can be found at GitHub's billing information for packages.

llm/.env.example

llm/README.md

llm/requirements.txt

agitter

A few more comments based on my testing.

llm/run.sh

llm/run.sub

llm/README.md

Co-authored-by: Anthony Gitter <[email protected]>

agitter · 2023-09-07T20:34:24Z

Thanks for continuing to make changes @JasonLo.

This comment from my testing was buried in a commit suggestion. When I removed the --use_wandb argument for my testing, it did not disable logging. My error file contained

wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)
wandb: ERROR The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main()
  File "train.py", line 80, in main
    train(args.run_name, args.use_wandb)
  File "train.py", line 61, in train
    trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/transformers/src/transformers/trainer.py", line 1544, in train
    return inner_training_loop(
  File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin
    self.setup(args, state, model, **kwargs)
  File "/transformers/src/transformers/integrations.py", line 734, in setup
    self._wandb.init(
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    raise e
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init
    run = wi.init()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init
    raise error
wandb.errors.AuthenticationError: The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)

Have you gotten this to run without the W&B logging?

ChristinaLK

@JasonLo finally reviewing this...sorry for the delay. Mainly adding little bits and bobs.

llm/run.sh

llm/run.sub

llm/README.md

llm/run.sub

JasonLo · 2023-09-13T14:34:15Z

Thanks for continuing to make changes @JasonLo.

This comment from my testing was buried in a commit suggestion. When I removed the --use_wandb argument for my testing, it did not disable logging. My error file contained

wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: ERROR Error while calling W&B API: user is not logged in (<Response [401]>)
wandb: ERROR The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main()
  File "train.py", line 80, in main
    train(args.run_name, args.use_wandb)
  File "train.py", line 61, in train
    trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/transformers/src/transformers/trainer.py", line 1544, in train
    return inner_training_loop(
  File "/transformers/src/transformers/trainer.py", line 1760, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/transformers/src/transformers/trainer_callback.py", line 353, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/transformers/src/transformers/trainer_callback.py", line 397, in call_event
    result = getattr(callback, event)(
  File "/transformers/src/transformers/integrations.py", line 760, in on_train_begin
    self.setup(args, state, model, **kwargs)
  File "/transformers/src/transformers/integrations.py", line 734, in setup
    self._wandb.init(
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    raise e
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 1147, in init
    run = wi.init()
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/wandb_init.py", line 762, in init
    raise error
wandb.errors.AuthenticationError: The API key you provided is either invalid or missing.  If the `WANDB_API_KEY` environment variable is set, make sure it is correct. Otherwise, to resolve this issue, you may try running the 'wandb login --relogin' command. If you are using a local server, make sure that you're using the correct hostname. If you're not sure, you can try logging in again using the 'wandb login --relogin --host [hostname]' command.(Error 401: Unauthorized)

Have you gotten this to run without the W&B logging?

This is fixed in c759b99

I've tested it; now it won't accidentally activate wandb.
Job ID run with c759b99: 17010138

JasonLo · 2023-09-13T14:36:21Z

@agitter @ChristinaLK
No more changes from my end. Your team can handle the merge.

agitter

Thanks for the updates @JasonLo and @ChristinaLK. My test job 17011164.0 is now training successfully without W&B logging. I'll check on it Friday and merge if everything is still looking good.

I didn't see any training convergence criteria. Does this continue saving checkpoints and training until the short GPU job limit is reached? That's fine and doesn't need to prevent merging.

JasonLo · 2023-09-14T19:17:01Z

I didn't see any training convergence criteria. Does this continue saving checkpoints and training until the short GPU job limit is reached? That's fine and doesn't need to prevent merging.

I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources. You can see this setting in the code here: GitHub Link to train.py.

llm/README.md

agitter

I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources.

Thanks, I see that now. I added a FAQ question about this.

My test job finished, and a second test job seemed to use the checkpoint because it finished quickly. I'll merge.

llm/README.md

JasonLo added 2 commits August 11, 2023 15:47

add llm example

72c9f07

polish docs

61350b3

agitter suggested changes Aug 17, 2023

View reviewed changes

jhiemstrawisc reviewed Aug 17, 2023

View reviewed changes

llm/.env.example Show resolved Hide resolved

agitter reviewed Aug 17, 2023

View reviewed changes

llm/README.md Show resolved Hide resolved

jhiemstrawisc reviewed Aug 17, 2023

View reviewed changes

llm/.env.example Outdated Show resolved Hide resolved

Addressed some comments

8d79784

JasonLo commented Aug 18, 2023

View reviewed changes

JasonLo added 2 commits August 18, 2023 03:09

add more inline explanations

d364fc7

fix links and minor changes

cd62f59

agitter reviewed Aug 23, 2023

View reviewed changes

llm/run.sh Outdated Show resolved Hide resolved

llm/run.sh Outdated Show resolved Hide resolved

llm/run.sub Show resolved Hide resolved

llm/README.md Outdated Show resolved Hide resolved

llm/README.md Outdated Show resolved Hide resolved

JasonLo and others added 5 commits August 25, 2023 09:47

Update llm/run.sh

c534181

Co-authored-by: Anthony Gitter <[email protected]>

Update llm/README.md

e838276

Co-authored-by: Anthony Gitter <[email protected]>

Update llm/run.sub

eba237d

Co-authored-by: Anthony Gitter <[email protected]>

Update llm/run.sh

4cd6c77

Co-authored-by: Anthony Gitter <[email protected]>

add suggested change

a4aa500

ChristinaLK mentioned this pull request Sep 8, 2023

Hugging Face Accelerate #29

Open

ChristinaLK approved these changes Sep 8, 2023

View reviewed changes

JasonLo and others added 3 commits September 12, 2023 19:56

accept all suggestions

b9797b8

fix not using use_wandb error

3e850b9

fix not using wandb error

c759b99

agitter approved these changes Sep 14, 2023

View reviewed changes

ChristinaLK approved these changes Sep 14, 2023

View reviewed changes

llm/README.md Outdated Show resolved Hide resolved

llm/README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

26db62d

agitter reviewed Sep 15, 2023

View reviewed changes

llm/README.md Outdated Show resolved Hide resolved

llm/README.md Show resolved Hide resolved

Apply suggestions from code review

b1e451c

agitter merged commit f062490 into CHTC:master Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM example #28

Add LLM example #28

JasonLo commented Aug 11, 2023

agitter commented Aug 11, 2023

agitter left a comment

agitter Aug 17, 2023

jhiemstrawisc Aug 17, 2023

JasonLo Aug 18, 2023

agitter Aug 18, 2023

JasonLo Sep 13, 2023

JasonLo left a comment

JasonLo Aug 18, 2023

agitter left a comment

agitter commented Sep 7, 2023

ChristinaLK left a comment

JasonLo commented Sep 13, 2023

JasonLo commented Sep 13, 2023

agitter left a comment

JasonLo commented Sep 14, 2023

agitter left a comment

Add LLM example #28

Add LLM example #28

Conversation

JasonLo commented Aug 11, 2023

agitter commented Aug 11, 2023

agitter left a comment

Choose a reason for hiding this comment

agitter Aug 17, 2023

Choose a reason for hiding this comment

jhiemstrawisc Aug 17, 2023

Choose a reason for hiding this comment

JasonLo Aug 18, 2023

Choose a reason for hiding this comment

agitter Aug 18, 2023

Choose a reason for hiding this comment

JasonLo Sep 13, 2023

Choose a reason for hiding this comment

JasonLo left a comment

Choose a reason for hiding this comment

JasonLo Aug 18, 2023

Choose a reason for hiding this comment

agitter left a comment

Choose a reason for hiding this comment

agitter commented Sep 7, 2023

ChristinaLK left a comment

Choose a reason for hiding this comment

JasonLo commented Sep 13, 2023

JasonLo commented Sep 13, 2023

agitter left a comment

Choose a reason for hiding this comment

JasonLo commented Sep 14, 2023

agitter left a comment

Choose a reason for hiding this comment