Closed jianhua2022 closed 10 months ago
For normalization, it is a convention to normalize with video feature length (We followed previous works).
Clip len exists to handle different FPS for other datasets. if clip length is 1, it is to handle 1 FPS datasets and if clip length is set to 2, it is for datasets with 0.5 fps
Hi, Thank you for your great work. I have a question about the span_label normalization, in training phrase, the span_label seems normlized with video feature length:
windows = torch.Tensor(windows) / (ctx_l * self.clip_len) # normalized windows in xx
; while in inference phrase:spans = span_cxw_to_xx(spans) * meta["duration"], spans = torch.clamp(spans, 0, meta["duration"])
. I am confused about this implementation. In my experiments, I try to normalized the span_label with video duration, the performance drops. Another question is about self.clip_len, I can't understand its function. Could you explain it?Thanks agian!