Doubts about the hyperparameter args.box_xyxy

yangli18 / VLTVG

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022

91 stars 8 forks source link

Doubts about the hyperparameter args.box_xyxy #5

Closed lmc8133 closed 2 years ago

lmc8133 commented 2 years ago

Thanks for your excellent work and releasing your code! I find that if doesn't set args.box_xyxy as True, then when calculate GIoU loss, the code will treat box format as cxcywh and convert to xyxy. https://github.com/yangli18/VLTVG/blob/e5be26b6154f333ebae41227304b23a1724a84d2/models/VLTVG.py#L119-L121 But when build dataset, the target box format is always xyxy https://github.com/yangli18/VLTVG/blob/e5be26b6154f333ebae41227304b23a1724a84d2/datasets/dataset.py#L111-L116 Is there any problem?

yangli18 commented 2 years ago

@lmc8133 The dataset's preprocessing procedure will convert the target box from xyxy format to cxcywh format. https://github.com/yangli18/VLTVG/blob/e5be26b6154f333ebae41227304b23a1724a84d2/datasets/transforms.py#L276 The prediction results of the model are also in cxcywh format.

lmc8133 commented 2 years ago

Thank you for your quick reply. There is another question, is target box format always be cxcywh? If I set args.box_xyxy as True, then the target box format and prediction results format should be xyxy, right? Where is the relative code?

Finally, I have one more question, does the box format affect the model's performance, is there any ablation studies? If not, we can just use xyxy format target box and don't need the 276 line in VLTVG/datasets/transforms.py and 119-121 lines in VLTVG/models/VLTVG.py.

Thanks!

yangli18 commented 2 years ago

@lmc8133

We always set args.box_xyxy=False in our implementation. Actually, args.box_xyxy is an argument that we added very early, and we haven't tried much with args.box_xyxy=True😂. You can give it a try.
We do not have the related ablation studies, but we guess using the xyxy format directly might not work well. Our visual feature extractor is taken from the pre-trained DETR, which use the cxcywh format during training. To be consistent with it, we also use the cxcywh format.

lmc8133 commented 2 years ago

@yangli18 Thank you so much for your answer, I get it.