Reproducing your results

johannes-tum commented 1 year ago

First of all, thank you very much for your great paper. You found a very good approach for monocular semi-supervised learning!

I am currently trying to build on your work. At the moment I cannot reproduce your results. For example I have done extensive experiments for the supervised case with 50% of the training data. I started multiple training runs. Here the results. I did probably at least 5-10 more runs that aren't even shown below (but in all these additional runs, I did not exceed the bold line in terms of 3D score).

R11-3D Car 0.7:

3d AP:17.6358, 14.6280, 13.2558
3d AP:21.2645, 17.1680, 15.4836
3d AP:22.0260, 17.4701, 15.5975
3d AP:18.1203, 14.6309, 12.8549
3d AP:20.9224, 17.4083, 15.9189
3d AP:22.4899, 17.7239, 16.0286
3d AP:24.6505, 18.3884, 16.8328
3d AP:21.3495, 17.6654, 16.0097

R40-BEV Car 0.7:

bev AP:26.0643, 19.0210, 16.8850
bev AP:24.7957, 18.5760, 15.5776
bev AP:26.6068, 19.3674, 16.4070
bev AP:23.2912, 17.7347, 15.9953
bev AP:24.9442, 18.3572, 16.3626
bev AP:27.0781, 19.8280, 16.7158
bev AP:28.6742, 20.2341, 17.1477
bev AP:25.7267, 19.8133, 16.9592

R40-3D Car 0.7:

3d AP:16.4562, 12.0196, 10.4153
3d AP:15.7405, 11.8362, 9.5583
3d AP:17.3220, 11.8199, 9.9591
3d AP:16.1307, 11.5719, 9.9839
3d AP:16.0885, 11.7437, 10.1636
3d AP:17.7728, 12.2514, 10.3096
3d AP:19.4858, 13.3419, 10.9816
3d AP:16.2777, 12.0919, 10.3341

In your paper you mention for R40 Car 0.7:

3d: 21.91, 15.43, 13.09
bev: 28.24 20.51 18.37

That is why I am wondering have you accidently reported R11 results? At least for the 3D case?

Also, the results show that Monoflex is a highly volatile model. They mention on their github page: "Note: we observe an obvious variation of the performance for different runs and we are still investigating possible solutions to stablize the results, though it may inevitably due to the utilized uncertainties." (https://github.com/zhangyp15/MonoFlex). How would deal with that in a follow-up paper? E.g. do you think it may be better to report mean and standard deviation instead of just reporting a single number in a table like your Table 1?

And thank you very much again for your paper.

yanglei18 commented 1 year ago

For the sake of reproducing the performance with 50% of the training data. I guess if it is how to select 50% from 3712 samples matters, You can try different random splits.

yanglei18 commented 1 year ago

We follow the same pattern as most related papers that just reporting a single number.

yanglei18 commented 1 year ago

It is more significant to reproduce the results using 100% training data.

johannes-tum commented 1 year ago

So from what I am seeing there is no randomness involved when deciding on the subset of data: https://github.com/yanglei18/Mix-Teaching/blob/main/MonoFlex/data/datasets/aug_dataset.py#L58

johannes-tum commented 1 year ago

Can you share the kitti_infos_train.pkl that you have used?

Update: I just checked my kitti_infos_train.pkl. The images are not randomly distributed, but fully ordered. 000000, 000003, 000007, .... This suggests to me that there is no randomness involved.

yanglei18 commented 1 year ago

Sorry for failing to access the old kitti_infos_train.pkl. Besides, random.choice is necessary.

johannes-tum commented 1 year ago

I compared the current code of Monoflex with your Monoflex code. I've got a few questions from that:

You are using these merged heads. This impacts the first layers of the head. Why are you doing that?
You don't perform horizontal flipping in case of bcp. Why is that?
You have increased the number of epochs from 100 to 300. From my training runs I don't see a lot of value from that in the first supervised training phase. So, why are you doing that? Is it because in later training stages it is necessary to properly make use of all the pseudo labels? If so, could we also increase it in stages to save training time?
How many epochs have you trained on wth GUPNet?

johannes-tum commented 1 year ago

Two more questions:

As far as I understood you always take the best model weights of the previous run and reinitialize in the next phase with those weights. In case that you train on trainval and submit to the test server, how do you know which of the 5 models to use? Do you use the 3D IoU during training?
Have you run Monoflex with multiple GPUs?
- And if so, with what batch size? E.g. the standard Monoflex implementation uses a batch size of 8. But if you use multiple GPUs, then you can easily increase the batch size. E.g. I was able to train with a batch size of 8 on a single 12 GB GPU.
- If you used a single GPU only, then I wonder why you have integrated SyncBatchNorm. They haven't been there in the original implementation

destinyls commented 1 year ago

Que: You have increased the number of epochs from 100 to 300. From my training runs I don't see a lot of value from that in the first supervised training phase. So, why are you doing that? Is it because in later training stages it is necessary to properly make use of all the pseudo labels? If so, could we also increase it in stages to save training time?

Rep: Exactly, 300 epoches is only necessary in semi-supervised training stages, which is also required for the GUPNet model.

destinyls commented 1 year ago

Que 1: As far as I understood you always take the best model weights of the previous run and reinitialize in the next phase with those weights. In case that you train on trainval and submit to the test server, how do you know which of the 5 models to use?

Rep 1: During the evaluation of test set, we choose the best model on val set, which may be not a good strategy.

Que 2: Do you use the 3D IoU during training?

Rep 2: Yes, we use the 3D IoU in geometry uncertainty.

Que 3: If you used a single GPU only, then I wonder why you have integrated SyncBatchNorm. They haven't been there in the original implementation

Rep 3: We use a single GPU only, which is enough for small-scale KITTI dataset.

johannes-tum commented 1 year ago

Sorry for failing to access the old kitti_infos_train.pkl. Besides, random.choice is necessary. Where and how should this random.choice be integrated?

yanglei18 commented 1 year ago

Firstly, load the kitti_infos_train.pkl and get the list containing 3712 samples, then select 1856 samples with random.choice from the above-loaded list.

johannes-tum commented 1 year ago

@yanglei18 Did you apply the random.choice here? https://github.com/yanglei18/Mix-Teaching/blob/main/MonoFlex/data/datasets/aug_dataset.py#L58

johannes-tum commented 1 year ago

Three more questions came up. It would be great if you could help me with them:

You told me that you train 5 models and use the weights of the best model to initialize the weights in the next phase. I performed multiple experiments. In some cases I initialized with ImageNet pretrained weights the next phase. In other cases I used the weights of the best model. I used the checkpoint _model_moderate_bestsoft. To my surprise results were noticeable better when I initialized again with ImageNet pretrained weights (e.g. on AP40 IoU 0.7 Easy by +2.0). What's your take on that?
You report for Monoflex 100% training data the results from the Monoflex paper (23.64, 29.86, 17.51, 23.05, 14.83, 20.68). I tried many trainings with 100% training data and results are very volatile. I never achieved the 23.64. I guess you need a portion of luck to get those numbers. But what I notice is that the original Monoflex implementation uses this as prediction heads: https://github.com/zhangyp15/MonoFlex/blob/main/runs/monoflex.yaml#L27, while you use these merge heads: https://github.com/yanglei18/Mix-Teaching/blob/main/MonoFlex/runs/monoflex.yaml#L35. It doesn't make a difference for the final prediction layer, but it does make a difference for penultimate layer. So I wonder, have you changed to merged heads on purpose or is this just an old Monoflex code? Which one would you recommend to use? To what extent do you think it could impact results?
It is not possible to load the publicly available Monoflex checkpoint at the moment, because it doesn't use these merged heads. So, how could you initialize the weights of the next training phase when it is impossible to load, because of shape differences on the weights?

yanglei18 / Mix-Teaching

Reproducing your results #5