wzzheng / TPVFormer

[CVPR 2023] An academic alternative to Tesla's occupancy network for autonomous driving.
https://wzzheng.net/TPVFormer/
Apache License 2.0
1.16k stars 105 forks source link

Question for pretrain model #18

Open promesse opened 1 year ago

promesse commented 1 year ago

Thanks for your great job I want to test the performance in resnet34. But in your code, there is no branch for no-cpkt file. can i just modify the code like this? if work, How many epoches of training do you recommend. 截图

huang-yh commented 1 year ago

It is okay to comment out these lines directly, i.e. random initialization. On the other hand, you can use the ImageNet pretrained weights of ResNet34 from torchvision.

promesse commented 1 year ago

It is okay to comment out these lines directly, i.e. random initialization. On the other hand, you can use the ImageNet pretrained weights of ResNet34 from torchvision.

thanks for replay. By the way,I used your code to train 24 epoches on 6 cards, and the final results didn't match the paper, the results were shown in the figure, and I wanted to know what might have caused it. In log Which of the following indicators do pts and vox correspond to in the paper? xx

huang-yh commented 1 year ago

Hi, could you provide the config file and shell command you used to produce the result? Given the large gap between the pts and vox metrics, I think you might be trying to reproduce the result for occupancy prediction task. In such case, these mIoUs are quite normal. Since the occupancy prediction task has also got to account for the background (i.e. empty space) in addition to foreground (i.e. semantic points), it cannot simply predict the empty space to be semantic foreground classes in order to boost mIoU as lidar segmentation task does. As a matter of fact, we have not provided the mIoU of the model trained for occupancy prediction task in the paper on Arxiv yet, which are 26.84% and 52.06% for pts and vox metrics, respectively. (quite close to yours) About the performance, the image backbone and batch size can both have an impact on mIoU in your case. For pts mIoU, we use point labels acquired by interpolating the tpv planes according to continuous location of points. For vox mIoU, we use point labels assigned according to which voxels they fall into. By the way, our model zoo is under construction, and we will release it asap.

promesse commented 1 year ago

Yes, i trained model in occupancy prediction task branch. The file is attached. So I understand that the current code does not support the training of occupancy prediction tasks, and you will update the code later? In addition, the current code does not support multiple batches, because input point size from different batch may be not equal, but this is not a big problem.^_^ 20230225_201144.zip

huang-yh commented 1 year ago

Sorry for the confusion. The current code supports the training of occupancy prediction task, but we have not reported the mIoU for the occupancy model in our paper. In our model zoo, we will try out more design choices, e.g. image resolution, image backbone and tpv resolution, and report their memory consumption, latency in addition to mIoU. And thank you for pointing out the batchsize problem. ^_^

promesse commented 1 year ago

What I'm curious about is Semantic scene completion results on SemanticKITTI validation set. It's mentioned in your paper. How am I supposed to reproduce it. xx

huang-yh commented 1 year ago

Hi, the code for semantic scene completion will be release soon.

vobecant commented 1 year ago

Hi @huang-yh, thank you very much for your work. I have two questions. 1) Similar to @promesse, I evaluated your trained network for voxel occupancy prediction and got the same results that you reported in this thread: 26.84% pts mIoU and 52.06% vox mIoU. However, I see that just by passing the point coordinates in floats instead of integers, i.e., having the full precision of the point location, can increase the point mIoU from 26.84 to 54.34. I wonder how this can boost the performance so much if the network was trained using voxel for both the cross-entropy and Lovasz loss. 2) Can this difference in performance be caused by the fact that the network was not trained just voxel predictions but also point predictions?

Thank you very much in advance for your reply.