How to preprocess a large text dataset (approximately 80 GB)

yl4579 / PL-BERT

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

MIT License

216 stars 39 forks source link

How to preprocess a large text dataset (approximately 80 GB) #51

Open SandyPanda-MLDL opened 4 months ago

SandyPanda-MLDL commented 4 months ago

I am trying to preprocess a huge text dataset (non English) as per the code of preprocess.ipynb as provided in the repo itself. In order to do so, I have split the large dataset into small chunks of 1.26 GB (approximately) and then trying to preprocess it. However, I am getting errors (like segmentation error, etc.,) and unable to complete the preprocessing for all the chunks. Can anyone suggest anything regarding this?

SoshyHayami commented 2 months ago

you don't have to do it the way the author pre-processed their dataset. just use the regular .map() and set num_proc to whatever your cpu can handle