PartialSampler and MultiEpochDataLoader

ylabbe / cosypose

Code for "CosyPose: Consistent multi-view multi-object 6D pose estimation", ECCV 2020.

MIT License

301 stars 89 forks source link

PartialSampler and MultiEpochDataLoader #33

Closed botanyldrm closed 3 years ago

botanyldrm commented 3 years ago

Firstly, thanks for this great repository. I want to understand a few points related to your training strategy.

The first one is batch number clipping which is done in partial sampler with "self.epoch_size = min(epoch_size, len(ds))". Why did you need something like that? I think this changes number of training samples seen in an epoch for trainings with different number of gpus. In your run_pose_training script, you normalize epoch_size with number of gpus. This means epoch_size will differ with gpu number and len(ds) will differ with batch size.

The second one is MultiEpochDataLoader. I did not understand exact purpose of this script. Why you did not use directly torch distributed dataloader?

ylabbe commented 3 years ago

Hi,

In my code, the "epoch" does not correspond to the standard definition of an epoch (e.g. pass on the entire dataset). The epoch_size is actually used to define the number of training iterations in one pseudo epoch which is used to define the interval for logging etc. I prefer thinking in terms of number of iterations because it does not depend on the dataset size.

I use MultiEpochDataLoader because torch dataloaders delete all processes in charge of loading the data at the end of each epoch (when the iterator is done). With my dataloader, the data queue keeps getting filled and is never deleted, which removes an overhead at the beginning of each epoch and makes training a bit faster.

botanyldrm commented 3 years ago

Thank you for your quick reply. I have a one final question after your reply. As I understand, since you are not using distributed dataloader, each process for each gpu has a different dataloader independent of each other and so in one epoch(with your epoch definition) model see Gpu_Number self.epoch_size iterations and Gpu_Number self.epoch_size * batch_size samples. Am I correct in this point?

ylabbe commented 3 years ago

Actually there is a slight imprecision in my previous response. The epoch_size defines the total number of image and not iterations. How it goes is that I define total number of images the networks sees (counting all processes) here. Then this epoch_size is divided by the number on gpus here and given to the current process (1 GPU).