Next-generation Video instance recognition framework on top of Detectron2 which supports InstMove (CVPR 2023), SeqFormer(ECCV Oral), and IDOL(ECCV Oral))
the number of instances must be the same in one video? if not how to padding the non-bbox?
`targets_for_clip_prediction.append({"labels": torch.stack(clip_classes,dim=0).max(0)[0],
"boxes": torch.stack(clip_boxes,dim=1), # [num_inst,num_frame,4]
'masks': torch.stack(clip_masks,dim=1), # [num_inst,num_frame,H,W]
'size': torch.as_tensor([h, w], dtype=torch.long, device=self.device),
the number of instances must be the same in one video? if not how to padding the non-bbox? `targets_for_clip_prediction.append({"labels": torch.stack(clip_classes,dim=0).max(0)[0], "boxes": torch.stack(clip_boxes,dim=1), # [num_inst,num_frame,4] 'masks': torch.stack(clip_masks,dim=1), # [num_inst,num_frame,H,W] 'size': torch.as_tensor([h, w], dtype=torch.long, device=self.device),
'inst_id':inst_ids,