wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
3.86k stars 1.03k forks source link

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

Open xingchensong opened 8 months ago

xingchensong commented 8 months ago

统计 开源数据 和 爬虫源, 不断更新中... 欢迎追加编辑

xingchensong commented 8 months ago

The Dataset of Speech Recognition (ASR) / Speech Translation (ST)

Chinese

name duration/h address remark
THCHS-30 30 https://openslr.org/18/
Aishell 150 https://openslr.org/33/
ST-CMDS 110 https://openslr.org/38/
Primewords 99 https://openslr.org/47/
aidatatang 200 https://openslr.org/62/
MagicData 755 https://openslr.org/68/
ASR&SD 160 http://ncmmsc2021.org/competition2.html if available
Aishell2 1000 http://www.aishelltech.com/aishell_2 if available
TAL ASR 100 https://ai.100tal.com/dataset
Common Voice 63 https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0
ASRU2019 ASR 500 https://www.datatang.com/competition if available
2021 SLT CSRC 398 https://www.data-baker.com/csrc_challenge.html if available
aidatatang_1505zh 1505 https://datatang.com/opensource if available
WenetSpeech 10000 https://github.com/wenet-e2e/WenetSpeech
KeSpeech 1542 https://openreview.net/forum?id=b3Zoeq2sCLq speech recognition, speaker verification, subdialect identification, voice conversion
MagicData-RAMC 180 https://arxiv.org/pdf/2203.16844.pdf conversational speech data recorded from native speakers of Mandarin Chinese
Mandarin Heavy Accent Conversational Speech Corpus 58.78 https://magichub.com/datasets/mandarin-heavy-accent-conversational-speech-corpus/
Free ST Chinese Mandarin Corpus - https://openslr.org/38/

English

name duration/h speakers address remark
Common Voice 2015 - https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0, Narrated Wikipedia; CC0-1.0
LibriSpeech 960 2480 https://openslr.org/12/ Audiobooks; CC-BY-4.0
ST-AEDS-20180100 4.7 - http://www.openslr.org/45/
TED-LIUM Release 3 430 2030 https://openslr.org/51/ TED talks; CC-BY-NC-ND 3.0
Multilingual LibriSpeech 44659 - https://openslr.org/94/ limited supervision
SPGISpeech 5000 - https://datasets.kensho.com/datasets/scribe if available
Speech Commands 10 - https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data
2020AESRC 160 - https://datatang.com/INTERSPEECH2020 if available
GigaSpeech 10000 - https://github.com/SpeechColab/GigaSpeech Audiobook, podcast, YouTube; apache-2.0
The People’s Speech 31400 - https://openreview.net/pdf?id=R8CwidgJ0yT Government, interviews; CC-BY-SA-4.0
Earnings-21 39 - https://arxiv.org/abs/2104.11348
VoxPopuli 24100+543 1310 https://arxiv.org/pdf/2101.00390.pdf, github 24100(unlabeled), 543(transcribed), European Parliament; CC0
CMU Wilderness Multilingual Speech Dataset 13 - http://festvox.org/cmu_wilderness/ Multilingual
How-2 Dataset 2000 - https://github.com/srvk/how2-dataset 2000(english asr) 300(english->portuguese st); Creative Commons BY-SA 4.0
AMI 100 - https://openslr.org/16/ meetings; CC-BY-4.0
SwitchBoard 260 540 https://catalog.ldc.upenn.edu/LDC97S62 Telephone conversations; LDC
Fisher 1960 11917 https://catalog.ldc.upenn.edu/LDC2004T19 telephone conversations; LDC

Chinese-English

name duration/h address remark
SEAME 30 https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2010/i10_1986.pdf
TAL CSASR 587 https://ai.100tal.com/dataset
ASRU2019 CSASR 200 https://www.datatang.com/competition if available
ASCEND 10.62 https://arxiv.org/pdf/2112.06223.pdf

Japanese (ja-JP)

name duration/h address remark
Common Voice 26 https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0
Japanese_Scripted_Speech_Corpus_Daily_Use_Sentence 18 https://magichub.io/cn/datasets/japanese-scripted-speech-corpus-daily-use-sentence/
LaboroTVSpeech 2000 https://arxiv.org/pdf/2103.14736.pdf
CSJ 650 https://github.com/kaldi-asr/kaldi/tree/master/egs/csj
JTubeSpeech 1300 https://arxiv.org/pdf/2112.09323.pdf

Korean (ko-KR)

name duration/h address remark
korean-scripted-speech-corpus-daily-use-sentence 4.3 https://magichub.io/cn/datasets/korean-scripted-speech-corpus-daily-use-sentence/
korean-conversational-speech-corpus 5.22 https://magichub.io/cn/datasets/korean-conversational-speech-corpus/

Russian (ru-RU)

name duration/h address remark
Common Voice 148 https://commonvoice.mozilla.org/zh-CN/datasets Common Voice Corpus 7.0
OpenSTT 20000 https://arxiv.org/pdf/2006.08274.pdf limited supervision

French (fr-Fr)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

Spanish (es-ES)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

Turkish (tr-TR)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

Arabic (ar)

name duration/h address remark
MediaSpeech 10 https://arxiv.org/pdf/2103.16193.pdf ASR system evaluation dataset

noise & nonspeech

name duration/h address remark
MUSAN - https://openslr.org/17/
Room Impulse Response and Noise Database - https://openslr.org/28/
AudioSet - https://ieeexplore.ieee.org/document/7952261
xingchensong commented 8 months ago

The Dataset of Speech Synthesis

Chinese name duration/h address remark
Aishell3 85 https://openslr.org/93/
Opencpop - https://wenet.org.cn/opencpop/download/ Singing Voice Synthesis
English name duration/h address remark
Hi-Fi Multi-Speaker English TTS Dataset 291.6 https://openslr.org/109/
LibriTTS corpus 585 https://openslr.org/60/
Speechocean762 - https://www.openslr.org/101/
RyanSpeech 10 http://mohammadmahoor.com/ryanspeech/
xingchensong commented 8 months ago

The Dataset of Speech Recognition & Speaker Diarization

Chinese name duration/h address remark application
Aishell4 120 https://openslr.org/111/ 8-channel, conference scenarios speech recognition, speaker diarization
ASR&SD 160 http://ncmmsc2021.org/competition2.html if available speech recognition, speaker diarization
zhijiangcup - https://zhijiangcup.zhejianglab.com/zhijiang/match/details/id/6.html if available speech recognition, speaker diarization
M2MET 120 https://arxiv.org/pdf/2110.07393.pdf 8-channel, conference scenarios speech recognition, speaker diarization
English name duration/h address remark application
CHiME-6 - https://chimechallenge.github.io/chime6/download.html if available speech recognition, speaker diarization
xingchensong commented 8 months ago

The Dataset of Speaker Recognition

Chinese name duration/h address remark application
CN-Celeb - https://openslr.org/82/
KeSpeech 1542 https://openreview.net/forum?id=b3Zoeq2sCLq speech recognition, speaker verification, subdialect identification, voice conversion
MTASS 55.6 https://github.com/Windstudent/Complex-MTASSNet
THCHS-30 40 http://www.openslr.org/18/
English name duration/h address remark
VoxCeleb Data - http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
xingchensong commented 8 months ago

The Resource of Crawler

name type address remark application
voicetube video https://tw.voicetube.com/ 台湾的在线英语学习平台,每个视频都附有英文和用户的母语(通常是中文)的字幕
Chinese-Podcasts collection of video & podcast https://github.com/alaskasquirrel/Chinese-Podcasts 收集整理的中文视频、播客、电台等
Mddct commented 6 months ago

https://www.atr-p.com/products/sdb.html#DIGI