z-x-yang / DoraemonGPT

Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models
BSD 3-Clause "New" or "Revised" License
75 stars 5 forks source link

TImeline of code release #1

Closed Ziyang412 closed 5 months ago

Ziyang412 commented 10 months ago

Hi, thanks for your awesome work, and congrats to your ICLR acceptance! I am wondering whether you have clear plans for the code release, it would be awesome if we could try some examples on your work. Thanks a lot!

z-x-yang commented 10 months ago

Hi, thank you for your kind words and support! We appreciate it.

Unfortunately, our work was not accepted by ICLR this time, despite addressing the concerns raised by the reviewers and receiving a high average score. However, we are currently working on resubmitting an improved version. Once the resubmission is complete, we plan to release the code as open source.

Thank you for your interest, and we hope to share our work with you soon!

Ziyang412 commented 10 months ago

Oh no, I saw you guys had a great rebuttal and the score looks decent. I am so sorry to hear that. Also, thanks a lot for the commitment to open-sourcing! I can't wait to try it on!

sumankwan commented 10 months ago

hi great work, I am wondering this can be used for 'smart' video 'editing'/generation. Once you understand an AI generated video isnt coherent, can it plan and execute models/APIs/tools (tied with database of some sort) to improve it?

z-x-yang commented 9 months ago

hi great work, I am wondering this can be used for 'smart' video 'editing'/generation. Once you understand an AI generated video isnt coherent, can it plan and execute models/APIs/tools (tied with database of some sort) to improve it?

Thank you for your interest in our work and for posing an excellent question!

Based on our research objectives, our current implementation revolves around the planning flow of processing a single task-related video. However, from an engineering standpoint, it is feasible to extract symbolic memory for a generated video from the execution process or integrate the generated video into a new workflow. Nonetheless, the former approach would increase the entire system's complexity and require further prompt engineering of LLMs. I believe the latter one is simpler and applicable.