WeDataset: List of (OpenSource data) + (Crawler Resources)

xingchensong commented 8 months ago

统计开源数据和爬虫源, 不断更新中... 欢迎追加编辑

xingchensong commented 8 months ago

The Dataset of Speech Recognition (ASR) / Speech Translation (ST)

Chinese

name	duration/h	address	remark
THCHS-30	30	https://openslr.org/18/
Aishell	150	https://openslr.org/33/
ST-CMDS	110	https://openslr.org/38/
Primewords	99	https://openslr.org/47/
aidatatang	200	https://openslr.org/62/
MagicData	755	https://openslr.org/68/
ASR&SD	160	http://ncmmsc2021.org/competition2.html	if available
Aishell2	1000	http://www.aishelltech.com/aishell_2	if available
TAL ASR	100	https://ai.100tal.com/dataset
Common Voice	63	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0
ASRU2019 ASR	500	https://www.datatang.com/competition	if available
2021 SLT CSRC	398	https://www.data-baker.com/csrc_challenge.html	if available
aidatatang_1505zh	1505	https://datatang.com/opensource	if available
WenetSpeech	10000	https://github.com/wenet-e2e/WenetSpeech
KeSpeech	1542	https://openreview.net/forum?id=b3Zoeq2sCLq	speech recognition, speaker verification, subdialect identification, voice conversion
MagicData-RAMC	180	https://arxiv.org/pdf/2203.16844.pdf	conversational speech data recorded from native speakers of Mandarin Chinese
Mandarin Heavy Accent Conversational Speech Corpus	58.78	https://magichub.com/datasets/mandarin-heavy-accent-conversational-speech-corpus/
Free ST Chinese Mandarin Corpus	-	https://openslr.org/38/

English

name	duration/h	speakers	address	remark
Common Voice	2015	-	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0, Narrated Wikipedia; CC0-1.0
LibriSpeech	960	2480	https://openslr.org/12/	Audiobooks; CC-BY-4.0
ST-AEDS-20180100	4.7	-	http://www.openslr.org/45/
TED-LIUM Release 3	430	2030	https://openslr.org/51/	TED talks; CC-BY-NC-ND 3.0
Multilingual LibriSpeech	44659	-	https://openslr.org/94/	limited supervision
SPGISpeech	5000	-	https://datasets.kensho.com/datasets/scribe	if available
Speech Commands	10	-	https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data
2020AESRC	160	-	https://datatang.com/INTERSPEECH2020	if available
GigaSpeech	10000	-	https://github.com/SpeechColab/GigaSpeech	Audiobook, podcast, YouTube; apache-2.0
The People’s Speech	31400	-	https://openreview.net/pdf?id=R8CwidgJ0yT	Government, interviews; CC-BY-SA-4.0
Earnings-21	39	-	https://arxiv.org/abs/2104.11348
VoxPopuli	24100+543	1310	https://arxiv.org/pdf/2101.00390.pdf, github	24100(unlabeled), 543(transcribed), European Parliament; CC0
CMU Wilderness Multilingual Speech Dataset	13	-	http://festvox.org/cmu_wilderness/	Multilingual
How-2 Dataset	2000	-	https://github.com/srvk/how2-dataset	2000(english asr) 300(english->portuguese st); Creative Commons BY-SA 4.0
AMI	100	-	https://openslr.org/16/	meetings; CC-BY-4.0
SwitchBoard	260	540	https://catalog.ldc.upenn.edu/LDC97S62	Telephone conversations; LDC
Fisher	1960	11917	https://catalog.ldc.upenn.edu/LDC2004T19	telephone conversations; LDC

Chinese-English

name	duration/h	address	remark
SEAME	30	https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2010/i10_1986.pdf
TAL CSASR	587	https://ai.100tal.com/dataset
ASRU2019 CSASR	200	https://www.datatang.com/competition	if available
ASCEND	10.62	https://arxiv.org/pdf/2112.06223.pdf

Japanese (ja-JP)

name	duration/h	address	remark
Common Voice	26	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0
Japanese_Scripted_Speech_Corpus_Daily_Use_Sentence	18	https://magichub.io/cn/datasets/japanese-scripted-speech-corpus-daily-use-sentence/
LaboroTVSpeech	2000	https://arxiv.org/pdf/2103.14736.pdf
CSJ	650	https://github.com/kaldi-asr/kaldi/tree/master/egs/csj
JTubeSpeech	1300	https://arxiv.org/pdf/2112.09323.pdf

Korean (ko-KR)

name	duration/h	address	remark
korean-scripted-speech-corpus-daily-use-sentence	4.3	https://magichub.io/cn/datasets/korean-scripted-speech-corpus-daily-use-sentence/
korean-conversational-speech-corpus	5.22	https://magichub.io/cn/datasets/korean-conversational-speech-corpus/

Russian (ru-RU)

name	duration/h	address	remark
Common Voice	148	https://commonvoice.mozilla.org/zh-CN/datasets	Common Voice Corpus 7.0
OpenSTT	20000	https://arxiv.org/pdf/2006.08274.pdf	limited supervision

French (fr-Fr)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

Spanish (es-ES)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

Turkish (tr-TR)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

Arabic (ar)

name	duration/h	address	remark
MediaSpeech	10	https://arxiv.org/pdf/2103.16193.pdf	ASR system evaluation dataset

noise & nonspeech

name	duration/h	address
MUSAN	-	https://openslr.org/17/
Room Impulse Response and Noise Database	-	https://openslr.org/28/
AudioSet	-	https://ieeexplore.ieee.org/document/7952261

xingchensong commented 8 months ago

The Dataset of Speech Synthesis

Chinese	name	duration/h	address	remark
Aishell3	85	https://openslr.org/93/
Opencpop	-	https://wenet.org.cn/opencpop/download/	Singing Voice Synthesis

English	name	duration/h
Hi-Fi Multi-Speaker English TTS Dataset	291.6	https://openslr.org/109/
LibriTTS corpus	585	https://openslr.org/60/
Speechocean762	-	https://www.openslr.org/101/
RyanSpeech	10	http://mohammadmahoor.com/ryanspeech/

xingchensong commented 8 months ago

The Dataset of Speech Recognition & Speaker Diarization

Chinese	name	duration/h	address	remark
Aishell4	120	https://openslr.org/111/	8-channel, conference scenarios	speech recognition, speaker diarization
ASR&SD	160	http://ncmmsc2021.org/competition2.html	if available	speech recognition, speaker diarization
zhijiangcup	-	https://zhijiangcup.zhejianglab.com/zhijiang/match/details/id/6.html	if available	speech recognition, speaker diarization
M2MET	120	https://arxiv.org/pdf/2110.07393.pdf	8-channel, conference scenarios	speech recognition, speaker diarization

English	name	duration/h	address	remark	application
CHiME-6	-	https://chimechallenge.github.io/chime6/download.html	if available	speech recognition, speaker diarization

xingchensong commented 8 months ago

The Dataset of Speaker Recognition

Chinese	name	duration/h	remark
CN-Celeb	-	https://openslr.org/82/
KeSpeech	1542	https://openreview.net/forum?id=b3Zoeq2sCLq	speech recognition, speaker verification, subdialect identification, voice conversion
MTASS	55.6	https://github.com/Windstudent/Complex-MTASSNet
THCHS-30	40	http://www.openslr.org/18/

English	name	duration/h	address	remark
VoxCeleb Data	-	http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

xingchensong commented 8 months ago

The Resource of Crawler

name	type	address	remark	application
voicetube	video	https://tw.voicetube.com/	台湾的在线英语学习平台，每个视频都附有英文和用户的母语（通常是中文）的字幕
Chinese-Podcasts	collection of video & podcast	https://github.com/alaskasquirrel/Chinese-Podcasts	收集整理的中文视频、播客、电台等

Mddct commented 6 months ago

https://www.atr-p.com/products/sdb.html#DIGI

wenet-e2e / wenet

WeDataset: List of (OpenSource data) + (Crawler Resources) #2094

The Dataset of Speech Recognition (ASR) / Speech Translation (ST)

Chinese

English

Chinese-English

Japanese (ja-JP)

Korean (ko-KR)

Russian (ru-RU)

French (fr-Fr)

Spanish (es-ES)

Turkish (tr-TR)

Arabic (ar)

noise & nonspeech

The Dataset of Speech Synthesis

The Dataset of Speech Recognition & Speaker Diarization

The Dataset of Speaker Recognition

The Resource of Crawler