Closed seaplus296 closed 13 hours ago
it seems T5 embedding from FrozenT5 has shape (B, max_length, D)
is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??
- neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?
@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?
By the way, I really like this approach, injecting subword, word-level info directly into codec.
@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?
By the way, I really like this approach, injecting subword, word-level info directly into codec.
Yes, You are right. I am sorry for the late reply. I donot notice this message in the pass days.
it seems T5 embedding from FrozenT5 has shape (B, max_length, D)
https://github.com/yangdongchao/LLM-Codec/blob/e21c1bff56fa40d46e42f2906838129aa4f2003d/codec/MSCodec.py#L73-L78
is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??
https://github.com/yangdongchao/LLM-Codec/blob/e21c1bff56fa40d46e42f2906838129aa4f2003d/codec/vq.py#L113