Open est31 opened 5 years ago
Cool, thanks. I'll integrate that when I have time.
They have 237 hours of labeled German data from the Librivox project: http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/
Why can't I visit?
@aikekeaishenghuo for some reason, their enitre website is down. I guess that it'll be up again soon.
@aikekeaishenghuo @est31 : Did you create any script to use M-AILABS dataset for Mozilla DeepSpeech? If so, could you please share?
I haven't used it for deepspeech yet. But I have created a library to read/writer audio datasets (https://github.com/ynop/audiomate). There is a reader to load the dataset (https://audiomate.readthedocs.io/en/latest/reference/io.html#m-ailabs-speech-dataset). You can just add it in the same way as the others https://github.com/ynop/deepspeech-german/blob/7d252e798734beb6d0073b849b06dd0a9f37e96a/prepare_data.py#L47-L50
Just change to reader = 'mailabs'
.
@ynop : Thanks! It worked. I note that the library also supports Common Voice dataset. Could you please point out how to use it for Common Voice?
It basically works the same way. Just with reader='common-voice'
.
Let me know which part is not clear.
Then I might can improve the documentation of the audiomate library.
@ynop : I tried using the library replicating the same steps for common-voice, but I am getting the below error.
Traceback (most recent call last):
File "./prepare_data.py", line 68, in <module>
cv_corpus = audiomate.Corpus.load(cv_path, reader='common-voice')
File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/corpus.py", line 122, in load
return reader.load(path)
File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/io/base.py", line 81, in load
return self._load(path)
File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/io/common_voice.py", line 69, in _load
CommonVoiceReader.load_subset(corpus, path, subset_idx)
File "/home/agarwal/python-environments/german-asr/lib/python3.5/site-packages/audiomate/corpus/io/common_voice.py", line 97, in load_subset
age = CommonVoiceReader.map_age(entry[4])
IndexError: list index out of range
You used the most recent version of common voice? And only the german part?
Yes, I used the most recent German Common Voice Corpus.
@ynop : I realized the corpus was not properly un-tarred. Now the code is running fine but the Mozilla Corpus is not getting appened in the test, train and dev files, while the other corpus are there. I have attached prepare_data.py. prepare_data.zip
Here is my Common Voice directory structure.
(env) agarwal:~/german-speech-corpus/mozilla/tmp$ ls
clips de.tar.gz dev.tsv invalidated.tsv other.tsv test.tsv train.tsv validated.tsv
Command:
(env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/ --voxforge ../german-speech-corpus/voxforge/ ../german-speech-corpus/testing
Also, when I just try to process Common Voice dataset, I get the below error.
(env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/ ../german-speech-corpus/testing
Traceback (most recent call last):
File "./prepare_data.py", line 80, in <module>
merged_corpus.import_subview('train', splits['train'])
KeyError: 'train'
Could you please guide, if I missed anything or it's the issue with the library?
Found the problem. The latest release of audiomate does not contain the updated common-voice reader. You could just install audiomate from master branch for now. I will just check if everything is fine for a new release.
I created a new release (https://pypi.org/project/audiomate/4.0.0/). Let me know if it works.
I checked with master branch and got the below error. Checking now with the https://pypi.org/project/audiomate/4.0.0/
(env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/ --voxforge ../german-speech-corpus/voxforge/ ../german-speech-corpus/testing/
Traceback (most recent call last):
File "./prepare_data.py", line 75, in <module>
clean_transcriptions(merged_corpus)
File "./prepare_data.py", line 27, in clean_transcriptions
transcription = utterance.label_lists[audiomate.corpus.LL_WORD_TRANSCRIPT][0].value
TypeError: 'LabelList' object does not support indexing
(env) agarwal:~/deepspeech-german$
(env) agarwal:~/deepspeech-german$ ./prepare_data.py --cv ../german-speech-corpus/mozilla/tmp/ ../german-speech-corpus/testing/
Traceback (most recent call last):
File "./prepare_data.py", line 75, in <module>
clean_transcriptions(merged_corpus)
File "./prepare_data.py", line 27, in clean_transcriptions
transcription = utterance.label_lists[audiomate.corpus.LL_WORD_TRANSCRIPT][0].value
TypeError: 'LabelList' object does not support indexing
Same error with https://pypi.org/project/audiomate/4.0.0/ also.
I fixed another error. But now it seems that there are some corrupt files in the common-voice ds.
return rawread.RawAudioFile(path)
File "/Users/matthi/Repos/deepspeech-german/.venv/lib/python3.7/site-packages/audioread/rawread.py", line 64, in __init__
self._file = aifc.open(self._fh)
File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/aifc.py", line 917, in open
return Aifc_read(f)
File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/aifc.py", line 358, in __init__
self.initfp(f)
File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/aifc.py", line 314, in initfp
chunk = Chunk(file)
File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/chunk.py", line 63, in __init__
raise EOFError
If you want you can check that for yourself. Otherwise I'll get to it another time.
@ynop : I parsed Tuda+Voxforge+MAILABS using Audiomate and Common Voice using the script from DeepSpeech project and combined all into one (all.csv). Now I want to take random size of 10%, 20% etc and split them to test, train, split. I used pandas to do it but I have getting some issues when running DeepSpeech.py. Probably I am messing up with some format. Not sure. Could you please guide. Or point me to some code? test_10_tuda+voxforge+mozilla.zip
file = pd.read_csv('all_tuda+voxforge+mozilla.csv')
random_subset = file.sample(frac=0.001)
train, validate, test = np.split(random_subset.sample(frac=1), [int(.7*len(random_subset)), int(.85*len(random_subset))])
print(int(random_subset.shape[0]))
train.to_csv('train_10_tuda+voxforge+mozilla.csv', encoding='utf-8', index=False)
validate.to_csv('validate_10_tuda+voxforge+mozilla.csv', encoding='utf-8', index=False)
test.to_csv('/test_10_tuda+voxforge+mozilla.csv', encoding='utf-8', index=False)
Cannot see any error in the output. The only difference to my implementation is, that i sorted by column 1 (https://github.com/ynop/audiomate/blob/master/audiomate/corpus/io/mozilla_deepspeech.py).
@ynop : When I create the csv's using the above code I pasted, I get the below error. The csv's looks ok but I don't know where is the issue. I saw your code, you have used several logics based on utterance length etc. I am creating a random split. Not sure, if that's the issue.
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input indices should be a matrix but received shape [0]
[[{{node SerializeSparse}}]]
[[node tower_1/IteratorGetNext (defined at ./DeepSpeech.py:183) ]]
[[node tower_1/IteratorGetNext (defined at ./DeepSpeech.py:183) ]]
I want to re-use your methods for creating splits but in my case my object for merged corpus is a CSV. How can I use your code to create a merged corpus and create splits? Please advice.
splitter = subset.Splitter(merged_corpus, random_seed=38)
splits = splitter.split_by_length_of_utterances(
{'train': 0.7, 'dev': 0.15, 'test': 0.15}, separate_issuers=True)
Thats already the code for splitting i used. Did it work?
@ynop : No, actually my object is CSV (one CSV is obtained by pre-processing Common Voice dataset using DeepSpeech and the other for MAILABS, Tuda-De dataset pre-processing using Audiomate) and Splitter method expects an object of Type Corpus.
Also, I need to pre-process the SWC dataset. I followed your documentation on Audiomate. On the 4th Step (https://audiomate.readthedocs.io/en/latest/documentation/indirect_support.html) i.e. Build Java Tools (mvn package) I am getting an exception, for which I opened an issue (https://bitbucket.org/natsuhh/swc/issues/68/missing). Did you earlier managed to pass through this step? If yes, do you have an old SWC code that you could share. I don't see any tagged releases to proceed further. Could you please guide?
For splitting, the only way would be to implement a reader for the common voice csv (https://audiomate.readthedocs.io/en/latest/documentation/new_dataset_format.html#corpus-reader).
I haven't checked SWC for a while, so i can't really help you there.
They have 237 hours of labeled German data from the Librivox project: http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/