Hello, I thoroughly enjoyed reading your paper, "Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering."
I am writing to ask about the code provided for the paper. I am trying to replicate the performance experiments but noticed that the visual features and bounding box features extracted with DETR are not provided. I have attempted to extract these features using a pre-trained DETR model and integrated them into your code. However, I am observing a performance difference of approximately 4%.
Could you possibly share the visual features and bounding box features you extracted for the experiments in your paper? Additionally, the paper does not specify the backbone used for DETR. Could you clarify whether it is based on ResNet50, ResNet101, or another backbone?
Hello, I thoroughly enjoyed reading your paper, "Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering."
I am writing to ask about the code provided for the paper. I am trying to replicate the performance experiments but noticed that the visual features and bounding box features extracted with DETR are not provided. I have attempted to extract these features using a pre-trained DETR model and integrated them into your code. However, I am observing a performance difference of approximately 4%.
Could you possibly share the visual features and bounding box features you extracted for the experiments in your paper? Additionally, the paper does not specify the backbone used for DETR. Could you clarify whether it is based on ResNet50, ResNet101, or another backbone?