You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In trainer.py, there are three .map functions where num_proc is not specified.
It should be possible to set this because it speeds up tokenization, spreading and normalizations by a significant amount.
Example from trainer.py row 227:
with tokenizer.entity_tracker(split=dataset_name):
dataset = dataset.map(
tokenizer,
batched=True,
remove_columns=set(dataset.column_names) - set(self.OPTIONAL_COLUMNS),
desc=f"Tokenizing the {dataset_name} dataset",
fn_kwargs={"return_num_words": is_evaluate},
num_proc=4, # Added this - should be specifiable
)
This sped up tokenization by about 4 times.
The text was updated successfully, but these errors were encountered:
In
trainer.py
, there are three.map
functions wherenum_proc
is not specified.It should be possible to set this because it speeds up tokenization, spreading and normalizations by a significant amount.
Example from
trainer.py
row 227:This sped up tokenization by about 4 times.
The text was updated successfully, but these errors were encountered: