yaya-sy / SpeechAya

0 stars 0 forks source link

Tokenize the French Librispeech #8

Open yaya-sy opened 1 month ago

yaya-sy commented 1 month ago

The goal is to discretize the speech data from the French Librispeech dataset you previously worked on (@abheesht17 & Adithiya): https://huggingface.co/datasets/abheesht/librispeech_fr. Remember that the accents are still missing, so we need to address this issue.

The following notebook show how we can achieve this and will help you start the task: https://colab.research.google.com/drive/1rgehRA-Aw65c2gJM0LTf6SL4g2WpQqT4?usp=sharing. You only need to push the dataset in the specified format; the audio is not required. The expected format is:

{
  'speech_tokens': The discrete tokens of the speech,
  'text': The text transcription of the speech.
}
yaya-sy commented 4 weeks ago

I fixed a mistake in the notebook, consider the new version please: https://colab.research.google.com/drive/1rgehRA-Aw65c2gJM0LTf6SL4g2WpQqT4?usp=sharing