yl4579 / AuxiliaryASR

Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)
MIT License
111 stars 30 forks source link

Why is " " used as the blank in the CTCLoss? #10

Open jamesparsloe opened 1 year ago

jamesparsloe commented 1 year ago

Hey @yl4579 thank you for your great work on this (and StyleTTS).

I was wondering if there was a reason for using " " as the blank token in the CTCLoss instead of something distinct from what can be returned from G2p as is suggested here? I was thinking of using something like id 80 if appending onto the vocab defined here.

Was wondering if this would affect the downstream training of StyleTTS much or if the aligner just has to be a "good enough" starting point?

Thanks!