M-AILABS Speech Dataset

est31 commented 5 years ago

They have 237 hours of labeled German data from the Librivox project: http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/

ynop commented 5 years ago

Cool, thanks. I'll integrate that when I have time.

aikekeaishenghuo commented 5 years ago

They have 237 hours of labeled German data from the Librivox project: http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/

Why can't I visit?

est31 commented 5 years ago

@aikekeaishenghuo for some reason, their enitre website is down. I guess that it'll be up again soon.

est31 commented 5 years ago

https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/

AASHISHAG commented 5 years ago

@aikekeaishenghuo @est31 : Did you create any script to use M-AILABS dataset for Mozilla DeepSpeech? If so, could you please share?

ynop commented 5 years ago

I haven't used it for deepspeech yet. But I have created a library to read/writer audio datasets (https://github.com/ynop/audiomate). There is a reader to load the dataset (https://audiomate.readthedocs.io/en/latest/reference/io.html#m-ailabs-speech-dataset). You can just add it in the same way as the others https://github.com/ynop/deepspeech-german/blob/7d252e798734beb6d0073b849b06dd0a9f37e96a/prepare_data.py#L47-L50

Just change to reader = 'mailabs'.

AASHISHAG commented 5 years ago

@ynop : Thanks! It worked. I note that the library also supports Common Voice dataset. Could you please point out how to use it for Common Voice?

ynop commented 5 years ago

It basically works the same way. Just with reader='common-voice'. Let me know which part is not clear. Then I might can improve the documentation of the audiomate library.

AASHISHAG commented 5 years ago

@ynop : I tried using the library replicating the same steps for common-voice, but I am getting the below error.

Traceback (most recent call last):
  File "./prepare_data.py", line 68, in <module>
    cv_corpus = audiomate.Corpus.load(cv_path, reader='common-voice')
  File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/corpus.py", line 122, in load
    return reader.load(path)
  File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/io/base.py", line 81, in load
    return self._load(path)
  File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/io/common_voice.py", line 69, in _load
    CommonVoiceReader.load_subset(corpus, path, subset_idx)
  File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/io/common_voice.py", line 97, in load_subset
    age = CommonVoiceReader.map_age(entry[4])
IndexError: list index out of range

ynop commented 5 years ago

You used the most recent version of common voice? And only the german part?

AASHISHAG commented 5 years ago

Yes, I used the most recent German Common Voice Corpus.

AASHISHAG commented 5 years ago

@ynop : I realized the corpus was not properly un-tarred. Now the code is running fine but the Mozilla Corpus is not getting appened in the test, train and dev files, while the other corpus are there. I have attached prepare_data.py. prepare_data.zip

Here is my Common Voice directory structure.

(env) agarwal:~/german-speech-corpus/mozilla/tmp$ ls
clips  de.tar.gz  dev.tsv  invalidated.tsv  other.tsv  test.tsv  train.tsv  validated.tsv

Command: (env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/ --voxforge ../german-speech-corpus/voxforge/ ../german-speech-corpus/testing

Also, when I just try to process Common Voice dataset, I get the below error.

(env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/ ../german-speech-corpus/testing
Traceback (most recent call last):
  File "./prepare_data.py", line 80, in <module>
    merged_corpus.import_subview('train', splits['train'])
KeyError: 'train'

Could you please guide, if I missed anything or it's the issue with the library?

ynop commented 5 years ago

Found the problem. The latest release of audiomate does not contain the updated common-voice reader. You could just install audiomate from master branch for now. I will just check if everything is fine for a new release.

ynop commented 5 years ago

I created a new release (https://pypi.org/project/audiomate/4.0.0/). Let me know if it works.

AASHISHAG commented 5 years ago

I checked with master branch and got the below error. Checking now with the https://pypi.org/project/audiomate/4.0.0/

(env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/ --voxforge ../german-speech-corpus/voxforge/ ../german-speech-corpus/testing/

Traceback (most recent call last):
  File "./prepare_data.py", line 75, in <module>
    clean_transcriptions(merged_corpus)
  File "./prepare_data.py", line 27, in clean_transcriptions
    transcription = utterance.label_lists[audiomate.corpus.LL_WORD_TRANSCRIPT][0].value
TypeError: 'LabelList' object does not support indexing

(env) agarwal:~/deepspeech-german$
(env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/  ../german-speech-corpus/testing/                         

Traceback (most recent call last):
  File "./prepare_data.py", line 75, in <module>
    clean_transcriptions(merged_corpus)
  File "./prepare_data.py", line 27, in clean_transcriptions
    transcription = utterance.label_lists[audiomate.corpus.LL_WORD_TRANSCRIPT][0].value
TypeError: 'LabelList' object does not support indexing

AASHISHAG commented 5 years ago

Same error with https://pypi.org/project/audiomate/4.0.0/ also.

ynop commented 5 years ago

I fixed another error. But now it seems that there are some corrupt files in the common-voice ds.

    return rawread.RawAudioFile(path)
  File "/Users/matthi/Repos/deepspeech-german/.venv/lib/python3.7/site-packages/audioread/rawread.py", line 64, in __init__
    self._file = aifc.open(self._fh)
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/aifc.py", line 917, in open
    return Aifc_read(f)
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/aifc.py", line 358, in __init__
    self.initfp(f)
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/aifc.py", line 314, in initfp
    chunk = Chunk(file)
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/chunk.py", line 63, in __init__
    raise EOFError

If you want you can check that for yourself. Otherwise I'll get to it another time.

AASHISHAG commented 5 years ago

@ynop : I parsed Tuda+Voxforge+MAILABS using Audiomate and Common Voice using the script from DeepSpeech project and combined all into one (all.csv). Now I want to take random size of 10%, 20% etc and split them to test, train, split. I used pandas to do it but I have getting some issues when running DeepSpeech.py. Probably I am messing up with some format. Not sure. Could you please guide. Or point me to some code? test_10_tuda+voxforge+mozilla.zip

file = pd.read_csv('all_tuda+voxforge+mozilla.csv')
random_subset = file.sample(frac=0.001)
train, validate, test = np.split(random_subset.sample(frac=1), [int(.7*len(random_subset)), int(.85*len(random_subset))])
print(int(random_subset.shape[0]))
train.to_csv('train_10_tuda+voxforge+mozilla.csv', encoding='utf-8', index=False)
validate.to_csv('validate_10_tuda+voxforge+mozilla.csv', encoding='utf-8', index=False)
test.to_csv('/test_10_tuda+voxforge+mozilla.csv', encoding='utf-8', index=False)

ynop commented 5 years ago

Cannot see any error in the output. The only difference to my implementation is, that i sorted by column 1 (https://github.com/ynop/audiomate/blob/master/audiomate/corpus/io/mozilla_deepspeech.py).

AASHISHAG commented 5 years ago

@ynop : When I create the csv's using the above code I pasted, I get the below error. The csv's looks ok but I don't know where is the issue. I saw your code, you have used several logics based on utterance length etc. I am creating a random split. Not sure, if that's the issue.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input indices should be a matrix but received shape [0]
         [[{{node SerializeSparse}}]]
         [[node tower_1/IteratorGetNext (defined at ./DeepSpeech.py:183) ]]
         [[node tower_1/IteratorGetNext (defined at ./DeepSpeech.py:183) ]]

I want to re-use your methods for creating splits but in my case my object for merged corpus is a CSV. How can I use your code to create a merged corpus and create splits? Please advice.

    splitter = subset.Splitter(merged_corpus, random_seed=38)
    splits = splitter.split_by_length_of_utterances(
    {'train': 0.7, 'dev': 0.15, 'test': 0.15}, separate_issuers=True)

ynop commented 5 years ago

Thats already the code for splitting i used. Did it work?

AASHISHAG commented 5 years ago

@ynop : No, actually my object is CSV (one CSV is obtained by pre-processing Common Voice dataset using DeepSpeech and the other for MAILABS, Tuda-De dataset pre-processing using Audiomate) and Splitter method expects an object of Type Corpus.

Also, I need to pre-process the SWC dataset. I followed your documentation on Audiomate. On the 4th Step (https://audiomate.readthedocs.io/en/latest/documentation/indirect_support.html) i.e. Build Java Tools (mvn package) I am getting an exception, for which I opened an issue (https://bitbucket.org/natsuhh/swc/issues/68/missing). Did you earlier managed to pass through this step? If yes, do you have an old SWC code that you could share. I don't see any tagged releases to proceed further. Could you please guide?

ynop commented 5 years ago

For splitting, the only way would be to implement a reader for the common voice csv (https://audiomate.readthedocs.io/en/latest/documentation/new_dataset_format.html#corpus-reader).

I haven't checked SWC for a while, so i can't really help you there.

ynop / deepspeech-german

M-AILABS Speech Dataset #7