wjun0830 / CGDETR

Official pytorch repository for CG-DETR "Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding"
https://arxiv.org/abs/2311.08835
Other
115 stars 11 forks source link

Questions about some details? #5

Closed jianhua2022 closed 10 months ago

jianhua2022 commented 10 months ago

Hi, Thank you for your great work. I have a question about the span_label normalization, in training phrase, the span_label seems normlized with video feature length: windows = torch.Tensor(windows) / (ctx_l * self.clip_len) # normalized windows in xx; while in inference phrase: spans = span_cxw_to_xx(spans) * meta["duration"], spans = torch.clamp(spans, 0, meta["duration"]). I am confused about this implementation. In my experiments, I try to normalized the span_label with video duration, the performance drops. Another question is about self.clip_len, I can't understand its function. Could you explain it?

Thanks agian!

wjun0830 commented 10 months ago

For normalization, it is a convention to normalize with video feature length (We followed previous works).

Clip len exists to handle different FPS for other datasets. if clip length is 1, it is to handle 1 FPS datasets and if clip length is set to 2, it is for datasets with 0.5 fps