yl4579 / AuxiliaryASR

Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)
MIT License
111 stars 30 forks source link

About the loss #4

Closed Charlottecuc closed 2 years ago

Charlottecuc commented 2 years ago

Hi. Could you kindly share your training loss of the model (maybe a tensorboard picture)? Thank you very much.

yl4579 commented 2 years ago

image

Ruinmou commented 2 years ago

Why are there negative values? Shouldn't loss be positive in general? WeChat  @Image_20220615150238 @yl4579

yl4579 commented 2 years ago

@hai8023 This happens when a lot of blank tokens appear in the targets for CTC loss. You need to check your labels to make sure it doesn't contain the space ( ). However, even if you don't have it, the optimal CTC loss will be negative because the blank token was added to the beginning and the end of the speech to indicate silence before and after the speech. This is intended behavior unless your audio is trimmed to have no silence at the beginning and the end of the speech. If you don't do this, the loss won't be negative, but it may not be able to recognize the silence at the beginning and the end of the speech.

Technically speaking, the padding token at the beginning and the end be <SOS> and <EOS> instead, but for voice conversion application, the input data is randomly sliced and does not form a sentence, so we abuse the notation and use to represent the silence which is what it meant to be used in CTC loss.

For more explanation, please see https://discuss.pytorch.org/t/negative-ctc-loss/79548/5

Ruinmou commented 2 years ago

Thank you very much for your reply