Open weihaosky opened 2 years ago
Hey, may I ask how long the training took with the 8 GPUs?
Hey, may I ask how long the training took with the 8 GPUs?
@MaximilianKummeth About 2 days
Hi, did you try the phase 1 as mentioned in the readme? How will it affect the training performance?
Hi, did you try the phase 1 as mentioned in the readme? How will it affect the training performance?
Yes, I train the phase1 exactly as the instruction in readme. I guess phase1 is important.
Thank you for your kind reply. Just wondering, did you run into errors below for phase 1?
| WARNING | models.neucon_network:compute_loss:242 - target: no valid voxel when computing loss
...
/python3.7/site-packages/torch/nn/functional.py", line 2114, in _verify_batch_size
raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 32])
I am using RTX3090 and seems like I can only get it running for phase 2, but not phase 1...
Thank you for your kind reply. Just wondering, did you run into errors below for phase 1?
| WARNING | models.neucon_network:compute_loss:242 - target: no valid voxel when computing loss ... /python3.7/site-packages/torch/nn/functional.py", line 2114, in _verify_batch_size raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 32])
I am using RTX3090 and seems like I can only get it running for phase 2, but not phase 1...
I forgot. But it looks familiar.
I cannot get the reported results either. @weihaosky similar as yours. Anyone figured out the reason?
Hi, I try to train the model from scratch but the results in the paper cannot be reproduced. I trained the network on 8 GTX 2080 Ti with batch size of 1 on each GPU. Then I tested the results of models trained after 25, 30, 35, 40, 45, 50, 60, 70 epochs. The best model only gets a performance of abe_rel=0.068 and fscore=0.482, which is far from the results in your paper.
Also, the results from your released model (47 epochs) is: AbsRel 0.065 AbsDiff 0.099 SqRel 0.038 RMSE 0.197 LogRMSE 0.113 r1 0.932 r2 0.961 r3 0.975 complete 0.892 dist1 0.053 dist2 0.135 prec 0.687 recal 0.471 fscore 0.557 which is the same as that reported by @ZuoJiaxing in #53 while different from that in the paper.
May I ask the training setup of yours? e.g. the batch size, the numbers of GPUs, the learning rate Many thanks!