First of all, thank you for sharing your work. I've been working with your code recently, modifying a few sections and noticed a few things I don't quite understand.
In your paper you stated that you used a random scaling of 0.9 to 1.1 as augmentation on the Scannet data set, however the augmentation code provided only applies random flipping and rotation or did I miss the section where the scaling is applied?
Secondly I wasn't able to reproduce your results on the Scannet data set, coming 1% short on the mAP score @25 IoU and @50 IoU. Now I'm wondering if this may be due to the fact that I'm only using a single GPU for training? As far as I understand you do not sync the batch norm across GPUs and the batch on each GPU being smaller may actually beneficial to the training?
And when I looked at the transformer code, I noticed that each attention layer uses 288 dimensional features. I was wondering if there is a specific reason for choosing this value, as it seems quite low to me and I would have thought that a power of 2 would be more inline with most architectures.
I would really appreciate it if I could gain your insights on this.
First of all, thank you for sharing your work. I've been working with your code recently, modifying a few sections and noticed a few things I don't quite understand.
In your paper you stated that you used a random scaling of 0.9 to 1.1 as augmentation on the Scannet data set, however the augmentation code provided only applies random flipping and rotation or did I miss the section where the scaling is applied?
Secondly I wasn't able to reproduce your results on the Scannet data set, coming 1% short on the mAP score @25 IoU and @50 IoU. Now I'm wondering if this may be due to the fact that I'm only using a single GPU for training? As far as I understand you do not sync the batch norm across GPUs and the batch on each GPU being smaller may actually beneficial to the training?
And when I looked at the transformer code, I noticed that each attention layer uses 288 dimensional features. I was wondering if there is a specific reason for choosing this value, as it seems quite low to me and I would have thought that a power of 2 would be more inline with most architectures.
I would really appreciate it if I could gain your insights on this.