zhenye234 / xcodec

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
88 stars 3 forks source link

The details on HuBERT-General-Audio #7

Open vican9000 opened 1 week ago

vican9000 commented 1 week ago

Hey, first of all, great work!

Two things bug me though:

  1. What's the semantic value of the HuBERT model you trained if it's using the first RVQ layer of the acoustic tokenizer? I.e. the acoustic model is already exposed to that.
  2. What was the sampling rate of the input audio for the semantic model? Is it the same for the acoustic model?
zhenye234 commented 2 days ago

Thank you for your interest in our work. 1, Our training approach aligns with that of the HuBERT model, with a modification being the target of our acoustic unit discovery system. Instead of employing k-means clustering on MFCCs, we utilize the first VQ (vector quantization) layer of the codec. 2,16khz