About the input on ARCH benchmark

zhenye234 / xcodec

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

104 stars 5 forks source link

About the input on ARCH benchmark #14

Open dzr1026 opened 1 week ago

dzr1026 commented 1 week ago

Thank you for your work, it's a very innovative piece of research. I have a question regarding the ARCH benchmark results (Table 5): What is the input for these results? Specifically, what is the "semantic representation"? Is it the latent space after RVQ (Residual Vector Quantization)? Or is the semantic representation the sum of the latent spaces from all eight quantizers?

zhenye234 commented 1 week ago

Quantized semantic feature, here https://github.com/zhenye234/xcodec/blob/a2e52d30b1ea424f76bb6b88357484d8021f3ab3/models/soundstream_semantic.py#L114

dzr1026 commented 1 week ago

Thank you for your reply！

ggiggit commented 3 days ago

@zhenye234 Thanks for your previous response! I have a couple more questions about Table 5, if you don't mind:

Could you please clarify the semantic representations for DAC, Encodec, and the Baseline Acoustic Codec in Table 5?
Also, I'm curious why SpeechTokenizer was excluded from the comparison?

Thanks so much for your help!