Question about iterative object box prediction

Hi, thanks for your sharing. I find for each decoder layer, you use the cluster_xyz as the initial location instead of the updated base_xyz https://github.com/zeliu98/Group-Free-3D/blob/ef8b7bb5c3bf5b49b957624595dc6a642b6d0036/models/detector.py#L221-L227

My question is since each layer uses the box location of the previous layer to produce the spatial encoding, why does each layer predict the offset to the gt box location relative to the initial cluster center instead of the updated center of the previous layer? In another word, why not

base_xyz, base_size = self.prediction_heads[i](query,
                           base_xyz=base_xyz,                                               
                           end_points=end_points, 
                           prefix=prefix)

And under your setting, I think the "auxiliary loss" is necessary? The reason is that if no auxiliary loss is applied, the prediction head of the first N-1 decoder layers will not get supervision for the center_residual, the updated box prediction and spatial encoding for the next decoder layer will be meanless. Am I correct?

Best, Xuyang

zeliu98 / Group-Free-3D

Question about iterative object box prediction #15