Question about training process

yangdongchao / LLM-Codec

The open source code for LLM-Codec

95 stars 2 forks source link

Question about training process #1

Closed seaplus296 closed 13 hours ago

seaplus296 commented 3 weeks ago

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

https://github.com/yangdongchao/LLM-Codec/blob/e21c1bff56fa40d46e42f2906838129aa4f2003d/codec/MSCodec.py#L73-L78

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

https://github.com/yangdongchao/LLM-Codec/blob/e21c1bff56fa40d46e42f2906838129aa4f2003d/codec/vq.py#L113

neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?

yangdongchao commented 3 weeks ago

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

https://github.com/yangdongchao/LLM-Codec/blob/e21c1bff56fa40d46e42f2906838129aa4f2003d/codec/MSCodec.py#L73-L78

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

https://github.com/yangdongchao/LLM-Codec/blob/e21c1bff56fa40d46e42f2906838129aa4f2003d/codec/vq.py#L113

neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?

Yes, we use global pooling to extract T5 embedding
We randomly crop a segment to train, due to the encoder is convolution net, so it can encode any length of audio in the inference.

seaplus296 commented 3 weeks ago

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

yangdongchao commented 2 weeks ago

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

Yes, You are right. I am sorry for the late reply. I donot notice this message in the pass days.