Open zhou431496 opened 9 months ago
Thanks for your interests! It would be interesting to test in the wild, but there may exists domain gap. We use 4 speakers in BEAT instead of 30. We think that using the full data can further unleash the potential of diffusion model. In this paper, we resort to disentangle the semantic and rhythm of gesture, and try to rhyme the gestures with the zero-shot editing ability of diffusion model, but our formulation for SAG is still rough. Here are my superficial views: Just like what you said, there exist noise in the data, while high-quality and annotated data like BEAT is hard to acquired. Now, i am not inclined to totally using latent neural model to capture the semantics. We notice the excellent work QPGesture that solve this task with motion matching in the phase space. With this technique, maybe we could first robustly retrieve the semantic gestures, then editing them. But how to distill the semantic gesture from data is still difficult. We try to use anomaly detection to distill semantics but no further exploration.
Thank you for your reply and look forward to more of your great works
hello, 这个工作很出色。我想知道这个工作的泛化性怎么样?如果只在BEAT数据集中训练,迁移到TED数据集中是否还能很好的work? 所学习的语义是否鲁棒。据我所知,大家还是脚本-手势做对比预训练来解耦语义,但是它们之间存在噪声。如果说在潜空间做语义的引导,还是受到词划分的影响。未来如何做语义的感知呢? 期待您的恢复