-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLM example #28
Add LLM example #28
Conversation
Thanks a lot for the contribution @JasonLo. Fine-tuning LLMs should be a high-demand example. We'll discuss who can review this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the excellent example. I read through all the files to get familiar with the organization and left initial comments. I still haven't run the example and may not have staging access set up on my account.
I'd like CHTC feedback (@ChristinaLK or @jhiemstrawisc perhaps?) as well. One of us should also independently confirm we can run the example.
llm/README.md
Outdated
- [Helper script](build.sh) | ||
- [.env](.env.example) for Github container registry credentials (`CR_PAT`) | ||
|
||
Users should consider building their own container to match their specific needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this step is optional, we can move this to the top and explicitly note it as an optional step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also mention that when I tested building my own container, a) it is a whopping 22GB and b) github defaulted to making it private so that my job couldn't fetch it. Is the token in .env
supposed to be passed to condor somehow so it can fetch the ghcr container? Otherwise, we should mention the step of making sure the container is updated to public. I did verify that the token has write/read: packages
set, so it should have the permission it needs to get the container.
@ChristinaLK As for the container being quite large, which is better -- using condor to move a large container over the network by pulling straight from ghcr.io, or should we set up /staging/gpu-examples
like we did with /squid/gpu-examples
for moving other large execution environments? (I assume 22GB is too large for squid to handle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be aware that using a private container may incur charges. More details can be found at GitHub's billing information for packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a whopping 22GB
How did it get so big? The latest version of huggingface/transformers-pytorch-gpu
has a compressed size of 10.38 GB on DockerHub. I would expect it already had all the large dependencies like CUDA, PyTorch, etc.
https://hub.docker.com/r/huggingface/transformers-pytorch-gpu/tags shows recent versions that are smaller, only 8 GB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, the Dockerfile only install a few small python dependencies on top of the original huggingface/transformers-pytorch-gpu
. I had tried to rebuild it today, but the size is still around 22 GB... Maybe they are using some more aggressive compression? Feel free to swap my image to any CHTC provided one if you see fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing some comments.
llm/README.md
Outdated
- [Helper script](build.sh) | ||
- [.env](.env.example) for Github container registry credentials (`CR_PAT`) | ||
|
||
Users should consider building their own container to match their specific needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be aware that using a private container may incur charges. More details can be found at GitHub's billing information for packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments based on my testing.
Co-authored-by: Anthony Gitter <[email protected]>
Co-authored-by: Anthony Gitter <[email protected]>
Co-authored-by: Anthony Gitter <[email protected]>
Co-authored-by: Anthony Gitter <[email protected]>
Thanks for continuing to make changes @JasonLo. This comment from my testing was buried in a commit suggestion. When I removed the
Have you gotten this to run without the W&B logging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JasonLo finally reviewing this...sorry for the delay. Mainly adding little bits and bobs.
This is fixed in c759b99 I've tested it; now it won't accidentally activate wandb. |
@agitter @ChristinaLK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @JasonLo and @ChristinaLK. My test job 17011164.0 is now training successfully without W&B logging. I'll check on it Friday and merge if everything is still looking good.
I didn't see any training convergence criteria. Does this continue saving checkpoints and training until the short GPU job limit is reached? That's fine and doesn't need to prevent merging.
I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources. You can see this setting in the code here: GitHub Link to train.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the training to run for only 1 epoch to serve as a demonstration without consuming too many resources.
Thanks, I see that now. I added a FAQ question about this.
My test job finished, and a second test job seemed to use the checkpoint because it finished quickly. I'll merge.
Thanks for the amazing OSG 2023 workshop.
Hopefully this example is helpful.