Closed Naive-Bayes closed 2 months ago
Thanks for your interest. (1) 12-th frame is the middle frame of a sequence of 24 frame, not the 12-th frame of 72 frames. Reason to use the middle frame instead of first frame: we used to use the first frame, yet some test video's first frame in Tiktok dataset contains no human. Besides, our test results (in metrics) based on the first frame differs from their test results reported in the paper, indicating that their protocols may vary from each other. So we want to fix the protocol and use the simpliest setting (use the middle frame as reference) to test scores. (2) Yes, 896x512 or 512x896. Video diffusion models commonly achieves best performance when we use the same resolution in training and testing. (3) 'landscape' and 'portrait' means 'landscape' and 'portrait' videos (width > height, height > width respectively). We collect videos from both orientation and use a unified model to model them together. For test set, we also collect both orientations. (4) A100 80G. The Internet video part is hopefully released in the next week. Other data and code will be later.
Great work! After reading the paper, I have some question about the HumanVid: (1) When eval the method, why use the 12th frame as the ref image and only consider 72 frames with a stride of 3. In DisCo, Animate Anyone, Magic Animate, Champ, they usually use the first frame as ref_img, and consider all the frames. Are there any special purpose of HumanVid protocol or previous protocol has shortcoming? (2) When eval the method, what resolution the CamAnimate adopt? The same as training resolution with (896*512)? (3) In Table 3, what the difference between the 'landscape' and 'portrait'? Why split the table into two sub-table? (4) Which A100 you use? 80G or 40G?
By the way, waiting the dataset and code release. Thank you.