Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Trainer #23

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

WIP: Trainer #23

wants to merge 2 commits into from

Conversation

isamu-isozaki
Copy link

This pr is a WIP. But is the conceptual idea for Issue #18

@isamu-isozaki isamu-isozaki marked this pull request as draft October 26, 2023 03:40
@isamu-isozaki
Copy link
Author

some issues so far is that for dataloader, the default transformer's trainer handles it in get_train_dataloader function which handles parallelizing via

return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))

so we might want to overwrite it for now but in future have api compatible with accelerate

@xrsrke
Copy link
Owner

xrsrke commented Oct 26, 2023

return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))

so we might want to overwrite it for now but in future have api compatible with accelerate

@isamu-isozaki Thank you so much for the PR. This is also the reason we don't want to use Trainer from transformers. Because we implement our own 3D parallelism, we don't want it to be wrapped by accelerate

@isamu-isozaki
Copy link
Author

isamu-isozaki commented Oct 26, 2023

@xrsrke Sounds good. Let me rewrite it to be the minimum training example. But in the future, it might be better to make it inherited, etc so that we can use the updated ver of the transformer's trainer/not having to maintain compatibility with it.

@isamu-isozaki
Copy link
Author

isamu-isozaki commented Oct 26, 2023

made it a minimal version proof of concept. (not tested yet)
If we want to expand more the three main options are

  1. Reimplement pretty much all of Trainer in pipegoose.
  2. Override some of the methods in Trainer to make it work

Overall, this might not be a quick pr if we do any of the 2 but ideally, if 2 works that'll be probably the best option maybe?

@xrsrke
Copy link
Owner

xrsrke commented Oct 26, 2023

@isamu-isozaki, the PR looks great. I think we should prefer option 1 because one potential direction for the future is that we will support the parallelization of any arbitrary transformer torch module, not just transformers. Since transformers is a hub where people push a trained model, our library is the one that people start training from scratch. I also recommend checking out the Lightning trainer [link]. They have excellent abstractions, like separating CallbackHandler (a thing that connects the callback and trainer), Callback, and Trainer.

Here are my learning notes on Lightning's trainer: https://projectfoundation.notion.site/Lightning-f027845e720d4f74aa876b045e58669b. They could be helpful for you

@xrsrke
Copy link
Owner

xrsrke commented Oct 26, 2023

I will assign the task to you! Thank you. Sometimes, we also hold discussions on our Discord. Do you have a Discord account? https://discord.gg/nSyGZB6Gpp

@xrsrke xrsrke assigned xrsrke and isamu-isozaki and unassigned xrsrke Oct 26, 2023
@xrsrke xrsrke self-requested a review October 26, 2023 21:10
@isamu-isozaki
Copy link
Author

Ah sounds good. How much features do you want for the initial Trainer. Like do you have some tests in mind for preliminary use?
And haha I think we are already friends on discord. Let me send a message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants