Multiple gpu errors - Githubissues

yssjglg-elder commented 2 years ago

RuntimeError: Sizes of tensors must match except in dimension 0. Got 32 and 16 (The offending index is 0)

This error will appear in ‘Validation sanity check‘

If I use multiple GPU training, the above error will appear. I think it's a bug in generating the data rather than a dimension mismatch, but I can't fix it, do you have any idea about that？

zubair-irshad commented 2 years ago

Thanks for your interest in our work. We currently only support single-gpu training with batch size of 32. Are you able to train fine on a single GPU?

Multi-GPU training is currently not supported with this codebase since we are able to train in a reasonable time i.e. around 1-2 days (both synthetic training and real fine tuning) with 13 GB GPU memory. Having said this, please feel free to open a pull request or feature request for multi-gpu training. We can try to look into it but cannot promise this feature be available soon.

Hope it helps!

yssjglg-elder commented 2 years ago

Ok, I see, thank you for your help.

dongho-Han commented 1 year ago

Thanks for the first issue! As I found in the author's ShAPO project, there is the same issue that training should be run in the environment with the GPU which has more than 13GB. Could you try to enhance this problem? Because I have multi GPUs with 12GB, I can't train the ShAPO, CenterSnap model. Thanks!

zubair-irshad commented 1 year ago

HI @dongho-Han,

Are you able to train with a smaller batch size than 32? We were able to fit CenterSnap and ShAPO on 13 GB memory with a batch size of 32. Let us know if a smaller batch size works for you.

For Multi-GPU training, apologies we don't support it currently but please free to open a pull request if you are able to make this enhancement. I would start by looking into how we can add the default pytorch lightning distributed training functionality by adding a flag here but we are using a slightly outdated PL version so this might break things. But please feel free to create a PR if that works on your end.

dongho-Han commented 1 year ago

Thank you for the answer. For Multi-GPU training, I got your point. In fact, I have a question in ShAPO project. The ShAPO model is trained by prepare_data/distributed_generate_data.py's output pickle files, not like CenterSnap models. If I want to change the batch size, by changing the configs/net_config.txt and run the net_train.py is enough? I thought prepare_data/distributed_generate_data.py should be run again, so I hesitated to do that. But for CenterSnap, As you mentioned, I think changing the configs/net_config.txt and run the net_train.py will be enough. Thanks!

zubair-irshad commented 1 year ago

You can change the batch size in the configs/net_config.txt both for CenterSnap and ShAPO. We load individual pickle files in both models and irrespective of how we generate the data, we could train with any batch size by changing it in the respective config files.

dongho-Han commented 1 year ago

Wow! Thank you for the meaningful advice! I have a few questions.

As we can see the train code, train is performed with 'pickle' files, not the original RGB-D(RGB+depth) images. Is there any reason about that? Can you share the intuition?
You meant that CenterSnap and ShAPO both load individual pickle files. Then, are the CenterSnap's attached datasets(ex. Real dataset link) same as ShAPO's prepare_data/distributed_generate_data.py's output pickle files? If they are different, can you share the different points?
As you mention in https://github.com/zubair-irshad/shapo/issues/8, ShAPO's pre-trained model(uploaded) is not the optimized models for evaluation. Is that also applied to CenterSnap(only trained enough for visualization)?
Continuing with question 3, How much epochs are needed to recall the performance(ex. mAP) as you wrote in the papers for CenterSnap, ShAPO? It is not mentioned in the github project pages.
Can I change the batch size with random number(ex. 7, 10, ...) not the divisor of 32(ex. 8, 16)?

Thanks for your help! I could train the ShAPO, CenterSnap model with my GPU by changing batch size to 16. Memory usage: almost 6500MiB.

Thank you!

zubair-irshad commented 1 year ago

Awesome, great to know that lower batch size works for you.

We do indeed store RGB, depth (for input) and poses and masks (for supervision only) in these datapoint pickle files. It is a compact way to store all information we need for training.
The difference is mentioned in each papers. We store SDF latent codes and texture latent codes for shapo and pointcloud latent codes for CenterSnap in these datapoint pickles. The rest of the information is same.
Correct, in CenterSnap, we do not perform any post-optimization and this is one of our contributions of ShAPO.
Yes, please feel free to play around with the batch size and choose the number that works for you.

zubair-irshad / CenterSnap

Multiple gpu errors #9