How to change the baseline to support training per video?

Hello, thanks for your new task and baseline. I have read some papers from the top ranks in VIS competition, and I have found that most of them had improved the tracking part or post processed after image instance segmentation model. In contrast, I'd like to try utilizing spatial-temporal feature across frames, such as 3D CNN, feature aggregation, etc. But I have encounted some problems in the programming. Is it necessary to change the baseline(mmdetection) to support training multiple frames per video firstly, otherwise I have no idea to input multiple frames and aggregate features simultaneously, and resolve the scale inconsistency across videos. Thanks for reading.

youtubevos / MaskTrackRCNN

How to change the baseline to support training per video? #25