num_proc not specified in .map functions #47

sandstromviktor · 2023-12-14T18:08:53Z

In trainer.py, there are three .map functions where num_proc is not specified.
It should be possible to set this because it speeds up tokenization, spreading and normalizations by a significant amount.

Example from trainer.py row 227:

        with tokenizer.entity_tracker(split=dataset_name):
            dataset = dataset.map(
                tokenizer,
                batched=True,
                remove_columns=set(dataset.column_names) - set(self.OPTIONAL_COLUMNS),
                desc=f"Tokenizing the {dataset_name} dataset",
                fn_kwargs={"return_num_words": is_evaluate},
                num_proc=4, # Added this - should be specifiable
            )

This sped up tokenization by about 4 times.

The text was updated successfully, but these errors were encountered:

tomaarsen · 2023-12-21T13:14:42Z

Woah, good call! I'd love to look into this.

sandstromviktor · 2024-01-13T11:55:42Z

I think that this can be set using the huggingface trainingargs dataloader_num_workers argument. If not please tell me why :)

tomaarsen · 2024-01-14T22:24:50Z

That makes sense, I should've thought of that!

tomaarsen linked a pull request Jan 9, 2024 that will close this issue

Add num_proc=4 to Dataset maps in Trainer #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_proc not specified in .map functions #47

num_proc not specified in .map functions #47

sandstromviktor commented Dec 14, 2023

tomaarsen commented Dec 21, 2023

sandstromviktor commented Jan 13, 2024

tomaarsen commented Jan 14, 2024

num_proc not specified in .map functions #47

num_proc not specified in .map functions #47

Comments

sandstromviktor commented Dec 14, 2023

tomaarsen commented Dec 21, 2023

sandstromviktor commented Jan 13, 2024

tomaarsen commented Jan 14, 2024