Closed tpzou closed 2 years ago
Hi @tpzou,
thanks for catching this, it is indeed a bug that is a result of the legacy issues (in the earlier versions ME used the last column to store the indices).
This bug means that our ego motion loss was applied on points, which always lie on a specific line (the line is different for each element in the batch due to the batch index being treated as a coordinate). Theoretically, our formulation of the ego loss would allow us to use random 3D points (as the points are transformed with both GT and EST, transformation parameters). Therefore, this bug should not have a significant effect on the performance and if anything should actually improve it.
I have uploaded an updated loss.py
but will leave this issue open and will update it when I have the results of further analysis (retraining the model).
Thank you again,
Best,
Zan
Hi @tpzou,
the model is still training but after fixing this bug the ego_loss as well as the total loss are lower both on training and validation data. I will update the results once the model is fully converged...
Hi @zgojcic , would you have the plane to make it distributed multi-GPU training? I train to use the spawn and DistributedDataParallel , but it seems something wrong with my code.
Hi @tpzou,
we have never tried that, but I can have a quick look in the next days/weeks. Otherwise you can just increase the accumulation iterations parameter if you would like to have a bigger batch size.
Hi, @zgojcic I am constructing my code based on yours, I have another question: as the chamfer_distance of Foreground loss cat foregrounds of all batches, perhaps that will make wrong matching?
Sorry for the extremely late response, I have somehow missed this post. Indeed, this will affect some cases and one should iterate over the batch. I think that the difference will be small, but I will update the model and post the result here.
Hi,
we have fixed the bug in the chamfer_distance computation to properly handle the batched data.
I have retrained a model after the fix, but the performance is basically the same (see the loss curves below) so for the sake of reduced confusion we will leave the old model as the default one (should you require a newly trained model simply let me know).
Thank you again for spotting this bug.
Orange is the old run and blue the run after fixing the chamfer_distance bug
hi, @zgojcic . Is there a mistake at loss.py line 83? The first column of p_s_temp[mask_temp,:3] is the batch id... Why should transform it?