some training problems - Githubissues

Hello, @Catherinehui

First of all, thank you very much for your interest in our study. I'm sorry to hear about the issues you're experiencing during your retraining process. Based on the information you've provided, it seems the error occurs in the evaluation stage. There are mainly two possible causes that I suspect:

Insufficient GPU memory during the all-gather operation after extracting the evaluation embeddings from each GPU. https://github.com/wngh1187/Diff-SV/blob/aa654b0e5a2089eaaebb7fcab9809816476f3edb/code/diff_sv/trainers/train.py#L228
Lack of RAM when calculating the cosine similarity among all pairs of trials. https://github.com/wngh1187/Diff-SV/blob/aa654b0e5a2089eaaebb7fcab9809816476f3edb/code/diff_sv/trainers/train.py#L260

If the first issue is the case, you might need to modify the DDP-based embedding extraction process. If the second one is the problem, a solution might be to change to a method that calculates cosine similarity for each pair of trials.

Please consider these potential solutions and see if they help resolve your issue. If you continue to encounter problems, please don't hesitate to provide more details so I can offer more targeted assistance.

wngh1187 / Diff-SV

some training problems #1