yh1008 commented 7 years ago

Potential solution: use supervised learning to classify whether a time frame (a word/phrase) is Chinese or not (binary class, 1 :: Chinese, 0 :: English)

wendywangwwt commented 7 years ago

For speech recognition NOT detection: we can check Baidu Deep Speech for some insights, which uses a WRAP-CTC model. This is an open source project on Github (because now it uses Deep Speech 2). https://github.com/baidu-research/warp-ctc

wendywangwwt commented 7 years ago

OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition Baseline

https://arxiv.org/pdf/1609.08412.pdf

wendywangwwt commented 7 years ago

Automatic Recognition of Cantonese-English Code-Mixing Speech http://aclclp.org.tw/clclp/v14n3/v14n3a3.pdf

wendywangwwt commented 7 years ago

Speech Recognition on English-Mandarin Code-Switching Data using Factored Language Models http://www.csl.uni-bremen.de/cms/images/documents/publications/DA_JanGebhardt.pdf

moonnee commented 5 years ago

OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition Baseline

https://arxiv.org/pdf/1609.08412.pdf

but this corpus is not open now.

yh1008 commented 5 years ago

The dataset I used is from LDC linguistic data consortium. Schools usually pay for LDC membership, which allow their students to use LDC’s dataset for academic interest. You can go on their website and contact your school (if you are currently enrolled in a program) to see if you are eligible to use for free, otherwise you might need to pay for the membership yourself which is at a high price...

On Thu, Oct 4, 2018 at 1:27 AM ywnee notifications@github.com wrote:

OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition Baseline

https://arxiv.org/pdf/1609.08412.pdf

but this corpus is not open now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yh1008/speech-to-text/issues/3#issuecomment-426930190, or mute the thread https://github.com/notifications/unsubscribe-auth/AOh71uxLQiwRrRHZJjqKbPko7hBL2-1Hks5uhcZ0gaJpZM4L174u .

--

-- Columbia University | Computer Science | 2017 Website http://emilyhua.com/ | LinkedIn https://www.linkedin.com/in/ye-emily-hua-16740394 | Email ye.hua@columbia.edu

moonnee commented 5 years ago

The dataset I used is from LDC linguistic data consortium. Schools usually pay for LDC membership, which allow their students to use LDC’s dataset for academic interest. You can go on their website and contact your school (if you are currently enrolled in a program) to see if you are eligible to use for free, otherwise you might need to pay for the membership yourself which is at a high price... On Thu, Oct 4, 2018 at 1:27 AM ywnee @.***> wrote: OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition Baseline https://arxiv.org/pdf/1609.08412.pdf but this corpus is not open now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AOh71uxLQiwRrRHZJjqKbPko7hBL2-1Hks5uhcZ0gaJpZM4L174u .

… -- Columbia University | Computer Science | 2017 Website http://emilyhua.com/ | LinkedIn https://www.linkedin.com/in/ye-emily-hua-16740394 | Email ye.hua@columbia.edu

I also use SEAME from LDC. But this corpus has a Singapore and Malaysia accent. OC16-CE80 is putonghua-English code-mixed. Just wondering where can get this corpus.

yh1008 commented 5 years ago

A I see. We only used SEAME. I don’t know much about the whereabouts of this putonghua-English code-swtich dataset. You may want to contact the authors listed on the paper to know the details.

yh1008 commented 5 years ago

Contact speechOcean 海天瑞声 maybe. I remember I got in touch with them briefly for the access to this dataset. But it is also at a high price... things might have changed over the year, but I would still recommend you at least try and see if they offer some kind of a collaboration.

moonnee commented 5 years ago

Contact speechOcean 海天瑞声 maybe. I remember I got in touch with them briefly for the access to this dataset. But it is also at a high price... things might have changed over the year, but I would still recommend you at least try and see if they offer some kind of a collaboration.

I emailed them, but they said this corpus is not open to sale now. What a pity!

yh1008 commented 5 years ago

Ohhhhh.... SAD

yh1008 / speech-to-text

How to build language detection model? what would be the unit of an audio? #3

OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition Baseline

OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition Baseline