Open zhixhan opened 1 month ago
Could you please specify the version of the xcodec model?
Could you please specify the version of the xcodec model?
Thank you for your reply. I test with the model named xcodec_hubert_librispeech
Maybe you can try the continuous representation here https://github.com/zhenye234/xcodec/blob/60cf2046d03fe60a5aefd64f1347076c061a4460/models/soundstream_semantic.py#L114
Maybe you can try the continuous representation here
Thank you for your reply! I have tested the XCodec model with o_semnatic representation and got ABX error rate 4.4% and 5.5%, which is still a little different from the result reported in your paper. (3.3% and 4.3%)
When I extracted the o_semnatic representation with SoundStream.forward method, I got the error "e_acoustic and e_semantic have different shape in dim2" at https://github.com/zhenye234/xcodec/blob/main/models/soundstream_semantic.py#L102. Thus, I added the pad operation the same as in the encode method. Although I don't think this is the cause of the inconsistent results, I don't make any other changes to the source code. Do you have any other suggestions? Thanks for your reply again.
Thanks for your amazing work.
I evaluate the released xcodec model on LibriSpeech test-clean set using ABX error rate metric. I perform the evaluation with the continuous representations before RVQ and after RVQ, but get the result 9.9% and 13.2% for within ABX and cross ABX respectively, which are much higher than those reported in the paper. However, I get the consistent results 3.6 and 4.7 for SpeechTokenzier in the same way.
Could you please give me some suggestions? Thank you so much!