Train on custom dataset pose metrics stuck at 0.0

matteobarato commented 1 year ago

Problem Description

I am encountering an issue while training PVNet on a custom dataset. I have created a custom dataset using Blendeproc and adapted it to the required format. However, during training, after approximately 20-30 epochs, I am achieving an Average Precision (AP) score of 1.0, but all other metrics remain stuck at 0.0. I have tested this behavior with both my custom dataset and the provided custom dataset mentioned in the Readme, and the issue persists in both cases.

I have attempted to address this problem by training for longer epochs (e.g., 350 epochs), adjusting the learning rate, and experimenting with and without data augmentation, but the metrics remain unchanged. Strangely, when I train PVNet on the Linemod dataset, after 20-30 epochs, all the scores increase as expected and do not stay at zero.

Furthermore, when visualizing the bounding boxes, I can see that the correct bounding boxes are drawn for my custom dataset and the dataset provided in the Readme. Similar visualizations are observed when I run visualize_train.

I suspect that there might be a bug in the training or evaluation code for custom datasets, or potentially in the dataset preprocessing step using python run.py --type custom.

Steps to Reproduce

To reproduce the issue:

Download custom dataset provided in Readme.
Start training PVNet on this custom dataset.
Monitor the metrics during training, specifically noting the AP score and other metrics.
Observe that AP reaches 1.0 while other metrics remain at 0.0 after 20-30 epochs.

Expected Behavior

I expect the metrics for all objects to increase and not remain stuck at 0.0, similar to the behavior observed when training on the Linemod dataset.

Please let me know if there is any additional information or logs required to diagnose and resolve this issue.

matteobarato commented 1 year ago

These are some example plots obtained with the visualize_train function on the custom dataset

assia855 commented 1 year ago

Hi @matteobarato did you solve this issue. Me too after some epochs I start having nan for: "vote_loss: nan seg_loss: nan loss: nan " and the same while running on my own dataset. I test the first wight and I had the box but not on my object and for the other wights I had no detection at all.

monajalal commented 12 months ago

@matteobarato how did you solve this issue?

zju3dv / clean-pvnet