Model doesn't learn #51

Open SilvesterYu opened 1 year ago

SilvesterYu commented 1 year ago

Hi! Thank you for your wonderful project! I was trying to train the model on the dataset provided in your README. I used the command:

python train.py --phi 0 --weights ../Linemod_and_Occlusion-001/COCO/efficientdet-d0.h5 linemod ../Linemod_and_Occlusion-001/Linemod_preprocessed/ --object-id 4

I got the result with 0 and nan values:

mAP: 0.0000
ADD: 0.0000
ADD-S: 0.0000
5cm_5degree: 0.0000
TranslationErrorMean_in_mm: nan
TranslationErrorStd_in_mm: nan
RotationErrorMean_in_degree: nan
RotationErrorStd_in_degree: nan
2D-Projection: 0.0000
Summed_Translation_Rotation_Error: nan
ADD(-S): 0.0000
AveragePointDistanceMean_in_mm: nan
AveragePointDistanceStd_in_mm: nan
AverageSymmetricPointDistanceMean_in_mm: nan
AverageSymmetricPointDistanceStd_in_mm: nan
MixedAveragePointDistanceMean_in_mm: nan
MixedAveragePointDistanceStd_in_mm: nan

Epoch 00001: ADD improved from -inf to 0.00000, saving model to checkpoints/20_09_2022_18_04_29/object_4/phi_0_linemod_best_ADD.h5
1790/1790 [==============================] - 1376s 769ms/step - loss: nan - classification_loss: 13689.5068 - regression_loss: nan - transformation_loss: 0.0000e+00

I followed issue #23 and the 2D bounding box is working fine. Am I missing something? I would really appreciate some help.

SilvesterYu commented 1 year ago

Hi, I figured that it is because tensorflow 1.15 requires cuda 10.0 but I am using a different cuda version.

I downgraded to cpu instead and the problem is resolved. The training is much slower though.

Here is what I did:

(1) downgrade to cpu if cuda is not 10.0

pip install tensorflow-cpu==1.15

(2) solve version errors

pip install 'h5py==2.10.0' --force-reinstall
pip install numpy==1.19.5

(3) adding a weight in the argument also helps, according to issue #23. The COCO weights can be downloaded from here: https://drive.google.com/drive/folders/1n3opeOo2ko9GwC9G5_OkVF1sk0NVd3YY, which includes efficientdet-d0.h5 in my command below.


using object id 4 as an example

python3 train.py --phi 0 --weights <path_to_weight>/efficientdet-d0.h5 linemod <path_to_dataset>/Linemod_preprocessed/ --object-id 4

If .decode() errors occur in custom_load_weights.py, comment out the .decode() line and un-comment the line without .decode()

BhavinPrajapti commented 1 year ago

@SilvesterYu Hello, I have win11 i want to run this reporistory but tensorflow 1.15.0 not supported for cuda 11.7. I have windows11 and cuda 10.2 is not available for win11. can you please help me regarding requirements installtion?