How to preprocess a large text dataset (approximately 80 GB) #51

SandyPanda-MLDL · 2024-06-13T04:36:24Z

I am trying to preprocess a huge text dataset (non English) as per the code of preprocess.ipynb as provided in the repo itself. In order to do so, I have split the large dataset into small chunks of 1.26 GB (approximately) and then trying to preprocess it. However, I am getting errors (like segmentation error, etc.,) and unable to complete the preprocessing for all the chunks. Can anyone suggest anything regarding this?

Respaired · 2024-08-09T18:11:01Z

you don't have to do it the way the author pre-processed their dataset. just use the regular .map() and set num_proc to whatever your cpu can handle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preprocess a large text dataset (approximately 80 GB) #51

How to preprocess a large text dataset (approximately 80 GB) #51

SandyPanda-MLDL commented Jun 13, 2024

Respaired commented Aug 9, 2024

How to preprocess a large text dataset (approximately 80 GB) #51

How to preprocess a large text dataset (approximately 80 GB) #51

Comments

SandyPanda-MLDL commented Jun 13, 2024

Respaired commented Aug 9, 2024