yochaiye / LipVoicer

Official Code implementation for the ICLR paper "LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading"
MIT License
47 stars 8 forks source link

The result of LSE-C GT (Ground Truth) on the LRS2 dataset I tested is 8.248 instead of 6.840 #5

Open MyBeautiful-Fantasy opened 2 months ago

MyBeautiful-Fantasy commented 2 months ago

Excellent work! Amazing LipVoicer!

I have a small question about the evaluation metric of sync: LSE-C and LSE-D.

In LIPVOICER: GENERATING SPEECH FROM SILENT VIDEOS GUIDED BY LIP READING, the LSE-C of GT LRS2 is 6.840 and the LSE-D is 7.194 (see Table 3 on Page 8). I actually measured the GT LRS2 result of LSE-C is 8.248 and the LSE-D is 6.258, following the evaluation guidance of Wav2Lip:

I found that this result (LSE-C is 8.248) is similar to the article Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert (see Table 1 on Page 7). BTW: My test object is the Test set part of the LRS2 dataset, a total of 1243. Did the author of LipVoicer fine-tune syncnet_v2.model (./syncnet_python/data/)?

yochaiye commented 2 months ago

Hi,

Thank you for the positive feedback! Always heartwarming to hear. As to your question, I don't remember it as accurately as I would wish now (it's been a while back), I think I ran the Wav2Lip recipe for real videos on LRS2. Basically, the Wav2Lip repo has two scripts for calculating the metrics: calculate_score_LRS and calculate_score_real_videos. As far as I remember, when I used the LRS recipe on ground-truth videos of LRS3 it matched the results obtained by the real videos script. However the scripts executed on LRS2 led to disagreeing results. Since calculate_score_real_videos adds preprocessing steps on top of what you can find in calculate_score_LRS, it's likely to be yield the more accurate results, albeit more time-consuming as well.

MyBeautiful-Fantasy commented 2 months ago

Thank you very much for your reply. I understand. I will try the second solution calculate_score_real_videos.

By the way, do you think the result calculated using the first method (calculate_score_LRS) is acceptable? Or is it reasonable to get accurate results in the LSR2 dataset?


Kindly Tips, when reproducing, I found that changing line 24 in ./configs/config.yaml to audios_dirwill prevent the error "TypeError: LipVoicerDataset.init() got an unexpected keyword argument 'audio_dir'", which may be a spelling error. Thanks again to the author who proposed such excellent work :)

yochaiye commented 2 months ago

Hi,

Sorry for the delay in my response. I think it would be ok to use calculate_score_LRS as it is prescribed by the authors of Wav2Lip.

Thank you for pointing out the typo in the code, I'll fix it

MyBeautiful-Fantasy commented 2 months ago

Dear author,

I hope this message finds you well and that I’m not causing any inconvenience. I have what may be my final question for a while.

Could you kindly provide the complete config.yaml file for the GRID dataset?

image

Alternatively, is it sufficient to only modify the w_video, w_asr, and asr_start parameters in the existing config.yaml file, while keeping the other configurations (e.g., [diffusion][T], [diffusion][beta_0]...) the same as in the config.yaml for LSR?

Thank you for your assistance!

Best regards

yochaiye commented 1 month ago

It is sufficient to change the values that you stated in the config file