Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer #18

Open
3 tasks
xrsrke opened this issue Oct 25, 2023 · 4 comments
Open
3 tasks

Trainer #18

xrsrke opened this issue Oct 25, 2023 · 4 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@xrsrke
Copy link
Owner

xrsrke commented Oct 25, 2023

Notes

  • Implement a Trainer, which is a wrapper of low-level DataParallel, TensorParallel and PipelineParallel modules. The user just plugs in their model and dataloader and trains. Similar to transformers.
  • Use pipegoose's DistributedDataLoader in the Trainer.
  • DistributedDataLoader is just take a regular wrapper, add a distributed sampler to it like pipegoose's readme.

APIs

Trainer

from pipegoose.trainer import Trainer, TrainingArguments

config = {
    "tensor_parallelism": {"parallel_size": 2},
    "pipeline_parallelism": {
        "parallel_size": 4,
        "params": {"num_microbatches": 5}
    },
    "data_parallelism": {
        "parallel_size": 2,
        "zero_1": True
    },
    "mixed_precision": {"fp16": True}, # or bf16
    "fusion": {
        "optim": True,
        "model": True
    }
}

args = TrainingArguments(
    optim="adam",
    learning_rate=1e-3,
    lr_scheduler="",
    num_train_epochs=100,
    num_eval_steps=50,
    seed=42,
    config=config
)

trainer = Trainer(
    model=model, # loaded from `transformers`
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[PrintResultCallback(), SaveCheckpointCallback()]
)

trainer.train()
trainer.eval()

Trainer Callback

from pipegoose.trainer import Callback

class LoggingCallback(Callback):
    def on_train_start(
        self, trainer, model, optim,
        train_dataloader, eval_dataloader
    ):
        print("Training is starting")

    def on_train_end(
        self, trainer, model, optim,
        train_dataloader, eval_dataloader
    ):
        print("Training is ending")

DistributedDataLoader

from torch.utils.data import DataLoader
from pipegoose.utils.data import DistributedDataLoader

dataloader = DataLoader(dataset, batch_size=1024, shuffle=False)
dataloader = DistributedDataLoader(dataloader, parallel_context)

TODOs

  • Trainer
  • Trainer's Callbacks
  • DistributedDataLoader
@xrsrke xrsrke converted this from a draft issue Oct 25, 2023
@xrsrke xrsrke added the good first issue Good for newcomers label Oct 25, 2023
@isamu-isozaki
Copy link

I think I'll do this tonight since it seems the easiest

@xrsrke
Copy link
Owner Author

xrsrke commented Oct 25, 2023

@isamu-isozaki Awesome, thank you! I will get back to you in a few hours with all the details!!

@isamu-isozaki
Copy link

@xrsrke I was thinking of maybe just inheriting from transformer's Trainer. wdyt?

@xrsrke
Copy link
Owner Author

xrsrke commented Oct 26, 2023

@isamu-isozaki Nope, I just checked Trainer from transformers. They modified our model's devices and stuff. We prefer implementing our own so we can incorporate distributed logging and callback in a specific rank, ParallelMode... and future changes. I just added some demo code (link).

Also one note, we only apply a specific parallel mode based on the parallel_context. For example, if data_parallel_size is greater than 1, then we wrap the model with DataParallel.

@xrsrke xrsrke moved this from Pending to In Progress in pipegoose v1 Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: In Progress
Development

No branches or pull requests

2 participants