Confusion about the ABX error rate

zhixhan commented 1 month ago

Thanks for your amazing work.

I evaluate the released xcodec model on LibriSpeech test-clean set using ABX error rate metric. I perform the evaluation with the continuous representations before RVQ and after RVQ, but get the result 9.9% and 13.2% for within ABX and cross ABX respectively, which are much higher than those reported in the paper. However, I get the consistent results 3.6 and 4.7 for SpeechTokenzier in the same way.

Could you please give me some suggestions? Thank you so much!

zhenye234 commented 1 month ago

Could you please specify the version of the xcodec model?

zhixhan commented 1 month ago

Could you please specify the version of the xcodec model?

Thank you for your reply. I test with the model named xcodec_hubert_librispeech

zhenye234 commented 1 month ago

Maybe you can try the continuous representation here https://github.com/zhenye234/xcodec/blob/60cf2046d03fe60a5aefd64f1347076c061a4460/models/soundstream_semantic.py#L114

zhixhan commented 1 month ago

Maybe you can try the continuous representation here

https://github.com/zhenye234/xcodec/blob/60cf2046d03fe60a5aefd64f1347076c061a4460/models/soundstream_semantic.py#L114

Thank you for your reply! I have tested the XCodec model with o_semnatic representation and got ABX error rate 4.4% and 5.5%, which is still a little different from the result reported in your paper. (3.3% and 4.3%)

When I extracted the o_semnatic representation with SoundStream.forward method, I got the error "e_acoustic and e_semantic have different shape in dim2" at https://github.com/zhenye234/xcodec/blob/main/models/soundstream_semantic.py#L102. Thus, I added the pad operation the same as in the encode method. Although I don't think this is the cause of the inconsistent results, I don't make any other changes to the source code. Do you have any other suggestions? Thanks for your reply again.

zhenye234 / xcodec

Confusion about the ABX error rate #9