Open Wangyupei opened 1 week ago
Hi,sorry for the lat reply. The current code use the GT image as conditional frame and generate the subsequent video frames for inference, so modifying the text prompt cannot modify the textual attributes well because the subsequent video frames are highly correlated with the conditional frame.
Thanks for your great work. However, in our experiment, we tried different text prompts according to your instruction (the dataset preparation and inference code), the video generation results are almost the same. Is there anything wrong?