yh1008 / speech-to-text

mixlingual speech recognition system; hybrid (GMM+NNet) model; Kaldi + Keras
http://llcao.net/cu-deeplearning17/project.html
70 stars 19 forks source link

dataset? #1

Closed yh1008 closed 7 years ago

yh1008 commented 7 years ago
  1. Where to find Mandarin and English speech dataset?
wendywangwwt commented 7 years ago

Chinese speech data from CSLT at TsingHua: http://cslt.riit.tsinghua.edu.cn/resources.php?Public%20data There are 2 databases we may need. (a) SUD-12 database for short utterance: http://data.cslt.org/susr/SUB12/index.html (b) THUCH30 database for Chinese: http://data.cslt.org/thchs30/README.html <-- this database is calling for competition.. interesting

wendywangwwt commented 7 years ago

Mandarin-English Code-Switching in South-East Asia

https://catalog.ldc.upenn.edu/LDC2015S04 Not sure if it's free

yh1008 commented 7 years ago

Columbia U is a member of LDC and we get the data from Julia and Brenda (for free)