Tokenize the French Librispeech

The goal is to discretize the speech data from the French Librispeech dataset you previously worked on (@abheesht17 & Adithiya): https://huggingface.co/datasets/abheesht/librispeech_fr. Remember that the accents are still missing, so we need to address this issue.

The following notebook show how we can achieve this and will help you start the task: https://colab.research.google.com/drive/1rgehRA-Aw65c2gJM0LTf6SL4g2WpQqT4?usp=sharing. You only need to push the dataset in the specified format; the audio is not required. The expected format is:

{
  'speech_tokens': The discrete tokens of the speech,
  'text': The text transcription of the speech.
}

yaya-sy / SpeechAya

Tokenize the French Librispeech #8