wngh1187 / Diff-SV

Pytorch implementation of Diff-SV: A Unified Hierarchical Framework for Noise-Robust Speaker Verification Using Score-Based Diffusion Probabilistic Models
MIT License
19 stars 2 forks source link

some training problems #1

Open Catherinehui opened 8 months ago

Catherinehui commented 8 months ago

Hello, I tried to retrain to reproduce, but the following errors were always reported, no matter which server I was training on, but I did not modify the code, I do not know if you have come across it, could you give me some advice? Thank you very much! device_issue

wngh1187 commented 8 months ago

Hello, @Catherinehui

First of all, thank you very much for your interest in our study. I'm sorry to hear about the issues you're experiencing during your retraining process. Based on the information you've provided, it seems the error occurs in the evaluation stage. There are mainly two possible causes that I suspect:

  1. Insufficient GPU memory during the all-gather operation after extracting the evaluation embeddings from each GPU. https://github.com/wngh1187/Diff-SV/blob/aa654b0e5a2089eaaebb7fcab9809816476f3edb/code/diff_sv/trainers/train.py#L228
  2. Lack of RAM when calculating the cosine similarity among all pairs of trials. https://github.com/wngh1187/Diff-SV/blob/aa654b0e5a2089eaaebb7fcab9809816476f3edb/code/diff_sv/trainers/train.py#L260

If the first issue is the case, you might need to modify the DDP-based embedding extraction process. If the second one is the problem, a solution might be to change to a method that calculates cosine similarity for each pair of trials.

Please consider these potential solutions and see if they help resolve your issue. If you continue to encounter problems, please don't hesitate to provide more details so I can offer more targeted assistance.