The model output 'nan' when training on the 3rd dataset.

yewzijian / RPMNet

RPM-Net: Robust Point Matching using Learned Features (CVPR2020)

MIT License

328 stars 60 forks source link

The model output 'nan' when training on the 3rd dataset. #6

Open Gilgamesh666666 opened 3 years ago

Gilgamesh666666 commented 3 years ago

Hi yewzijian:

I am trying to train the model on my own datasets, but the model output 'nan' when training several epoch. I find the 'nan' are always coming from this line (https://github.com/yewzijian/RPMNet/blob/c37e68730ac3493f2954c67c16208e98d21547e2/src/models/feature_nets.py#L197) the 0.weight of the self.prepool module will be 'nan' tensor after several epoch, i cilp the gradients but it seems make no sense.I will appreciate if you can give me some advices.

yewzijian commented 3 years ago

That's weird. prepool is just a simple MLP which shouldn't lead to nan's (unless the weights or inputs are nan's). I'm not sure what might be the cause since the provided code does train stably on the ModelNet40 dataset.

Are your point clouds of similar spatial extents and density? The sampling and grouping layer, i.e. sample_and_group_multi() needs to sample a reasonable number of points to compute the features.

zhulf0804 commented 3 years ago

Hi yewzijian:

I am trying to train the model on my own datasets, but the model output 'nan' when training several epoch. I find the 'nan' are always coming from this line (

https://github.com/yewzijian/RPMNet/blob/c37e68730ac3493f2954c67c16208e98d21547e2/src/models/feature_nets.py#L197

) the 0.weight of the self.prepool module will be 'nan' tensor after several epoch, i cilp the gradients but it seems make no sense.I will appreciate if you can give me some advices.

Hi @Gilgamesh666666, I encountered the same nan problem. Maybe reducing the lr solves the problem.