Training Problem - Githubissues

yangcaoai / CoDA_NeurIPS2023

Official code for NeurIPS2023 paper: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection

https://yangcaoai.github.io/publications/CoDA.html

MIT License

174 stars 15 forks source link

Training Problem #15

Open JiangXiaobai00 opened 1 week ago

JiangXiaobai00 commented 1 week ago

Hi, I encountered two issues during training: 1. The memory usage gradually increases during training, eventually resulting in an "out of memory" message; 2. There is a lack of data in the second phase of training: Data/sunrgb_d/sunrgbd_v1_revised_0415/sunrgbd_pc_bbox_votes_50k_v1_all_classes_revised_0415_noveltrain_pseudo_labels_setting0/007974_novel_bbox.npy.

yangcaoai commented 1 week ago

Hi (1) Which kind of GPU do you use and how much memory does it have? (2) Could you please check if the path ‘Data/sunrgb_d/sunrgbd_v1_revised_0415/sunrgbd_pc_bbox_votes_50k_v1_all_classes_revised_0415_noveltrain_pseudo_labels_setting0/’ exists?

JiangXiaobai00 commented 1 week ago

Hello, regarding the first issue, I am using 8 I20-restricted GPUs with 48G each, and the memory usage has been gradually increasing from around 16G. I retrained, and now the memory usage has increased to 32G, and it seems like it will continue to grow. As for the second issue, the path indeed does not exist. May I ask if this is something that needs to be generated? If so, how can it be generated?

yangcaoai commented 1 week ago

Hi (1) The memory is more than enough for training. I'm not sure why your training is increasing memory; my training and others haven't had this issue. You may need to debug the training process. (2) Yes, we need to store the discovered novel boxes in this path, so you'll need to create the empty path 'Data/sunrgb_d/sunrgbd_v1_revised_0415/sunrgbd_pc_bbox_votes_50k_v1_all_classes_revised_0415_noveltrain_pseudo_labels_setting0/’

JiangXiaobai00 commented 1 week ago

Understood. For the first issue, based on the debugging results, it is suspected that some variables during the validation phase are not being released, as there is a significant increase in memory usage after validation. The training phase also sees an increase, but the magnitude is not as large.

yangcaoai commented 1 week ago

Interesting observation. Is the validation completely finished during training? (maybe you can check if the performance is successfully printed and if the performance number reasonable)

JiangXiaobai00 commented 1 week ago

Thank you for your response. The performance is normal. And It seems that even without the testing phase, the memory is still not sufficient. I also experienced this with 4* 2080Ti GPUs for training; the memory usage increases by 10MB for every 10 training iterations.

yangcaoai commented 1 week ago

It's a bit strange. I've already used optimizer.zero_grad() during training, and model.eval() with @torch.no_grad() during testing to prevent gradients from accumulating. Besides, 8 I20-restricted GPUs with 48G each should be enough for training the model. The key is to solve the memory increasing issue.

JiangXiaobai00 commented 1 week ago

Yes, it should be fine in theory, but the training always stops due to memory issues.

yangcaoai commented 1 week ago

Do you utilize 'PyTorch 1.8.1, torchvision==0.9.1, CUDA 10.1 and Python 3.7. '?

JiangXiaobai00 commented 1 week ago

No, due to the CUDA version 11.7， i utilize the corresponding Python==3.8.19, PyTorch==2.0.1, torchvision==0.15.2.

yangcaoai commented 1 week ago

The reason may be the differences in the experiment environments.

JiangXiaobai00 commented 1 week ago

Thanks. I will switch to the same environment to run the program and observe the memory situation.

yangcaoai commented 1 week ago

Yeah, my pleasure. If there are any questions, feel free to continue our discussion.

JiangXiaobai00 commented 1 week ago

Thanks very much for your patient responses.