yaya-sy / SpeechAya

0 stars 0 forks source link

Add English Multilingual Librispeech #7

Open yaya-sy opened 1 month ago

yaya-sy commented 1 month ago

Extract the English subset in the Multilingual Librispeech dataset (https://huggingface.co/datasets/facebook/multilingual_librispeech). The resulting data must look like this:

{
   'audio': the audio file,
   'text": the transcription of the audio
}

Then, you can push the dataset to HuggingFace.

mariadhakal commented 1 month ago

@yaya-sy It seems like this dataset doesn't have English data.

yaya-sy commented 1 month ago

@mariadhakal, you're right! For English, we have a separate dataset for Librispeech: openslr/librispeech_asr. We need the 'clean' subset and the 'train-360' split.

Capture d’écran 2024-08-03 à 23 34 26
mariadhakal commented 1 month ago

Thanks!

mariadhakal commented 1 month ago

I have uploaded the dataset. It was my first time doing such a task. Can you please check and let me know if everything is okay? https://huggingface.co/datasets/mariadhakal/dataset_librispeech_english_clean_train360

yaya-sy commented 1 month ago

Thank you! I see a list of floats in the 'audio' column in the pushed data. Actually, we need an 'audio dataset' on Hugging Face. See here: https://huggingface.co/docs/datasets/audio_dataset. For the LibriSpeech dataset you're working on, I think we can achieve our goal by removing all columns except for 'audio' and 'text.'

And since we don't have as much data for English, in addition to train.clean.360, I think we will also need to add train.clean.100 and train.other.500. Maybe @abheesht17 can look into that.