Open yssjglg-elder opened 2 years ago
Thanks for your interest in our work. We currently only support single-gpu training with batch size of 32. Are you able to train fine on a single GPU?
Multi-GPU training is currently not supported with this codebase since we are able to train in a reasonable time i.e. around 1-2 days (both synthetic training and real fine tuning) with 13 GB GPU memory. Having said this, please feel free to open a pull request or feature request for multi-gpu training. We can try to look into it but cannot promise this feature be available soon.
Hope it helps!
Ok, I see, thank you for your help.
Thanks for the first issue! As I found in the author's ShAPO project, there is the same issue that training should be run in the environment with the GPU which has more than 13GB. Could you try to enhance this problem? Because I have multi GPUs with 12GB, I can't train the ShAPO, CenterSnap model. Thanks!
HI @dongho-Han,
Are you able to train with a smaller batch size than 32? We were able to fit CenterSnap and ShAPO on 13 GB memory with a batch size of 32. Let us know if a smaller batch size works for you.
For Multi-GPU training, apologies we don't support it currently but please free to open a pull request if you are able to make this enhancement. I would start by looking into how we can add the default pytorch lightning distributed training functionality by adding a flag here but we are using a slightly outdated PL version so this might break things. But please feel free to create a PR if that works on your end.
Thank you for the answer.
For Multi-GPU training, I got your point.
In fact, I have a question in ShAPO project.
The ShAPO model is trained by prepare_data/distributed_generate_data.py
's output pickle files, not like CenterSnap models.
If I want to change the batch size, by changing the configs/net_config.txt
and run the net_train.py
is enough? I thought prepare_data/distributed_generate_data.py
should be run again, so I hesitated to do that.
But for CenterSnap, As you mentioned, I think changing the configs/net_config.txt
and run the net_train.py
will be enough.
Thanks!
You can change the batch size in the configs/net_config.txt
both for CenterSnap and ShAPO. We load individual pickle files in both models and irrespective of how we generate the data, we could train with any batch size by changing it in the respective config files.
Wow! Thank you for the meaningful advice! I have a few questions.
Thank you!
Awesome, great to know that lower batch size works for you.
We do indeed store RGB, depth (for input) and poses and masks (for supervision only) in these datapoint pickle files. It is a compact way to store all information we need for training.
The difference is mentioned in each papers. We store SDF latent codes and texture latent codes for shapo and pointcloud latent codes for CenterSnap in these datapoint pickles. The rest of the information is same.
Correct, in CenterSnap, we do not perform any post-optimization and this is one of our contributions of ShAPO.
Yes, please feel free to play around with the batch size and choose the number that works for you.
RuntimeError: Sizes of tensors must match except in dimension 0. Got 32 and 16 (The offending index is 0)
This error will appear in ‘Validation sanity check‘
If I use multiple GPU training, the above error will appear. I think it's a bug in generating the data rather than a dimension mismatch, but I can't fix it, do you have any idea about that?