whwu95 / Cap4Video

【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
https://arxiv.org/abs/2301.00184
MIT License
225 stars 16 forks source link

Two branch or two loss #4

Closed HuazhangHu closed 1 year ago

HuazhangHu commented 1 year ago

In paper, you mentioned that “To reduce conflict between the two branches, the query-video branch is trained first, followed by the query-caption branch”。However, you also mentioned that "The total loss L is the sum of Query-Video loss L{QV} and Query-Caption loss L{QC} ". Are the two branches trained separately? My question is that what is the loss when first train query-video branch the loss and secondly train query-caption branch. In addition, how much epochs do it take to train first query-video branch.

whwu95 commented 1 year ago

In our paper, the query-video branch and the query-caption branch are trained separately. We first train the query-video branch for 5 epochs. Once this branch is trained, we proceed to train the query-caption branch.

shams2023 commented 9 months ago

在我们的论文中,查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支,我们就继续训练查询标题分支。

我看了你的代码,我发现在train_video.py中就已经使用到了字幕caption,那么此时我该如何理解你所说的前5轮是训练查询-视频分支的?(在我的理解中,你前5个epoch为了训练查询-视频分支,那么就不该出现字幕,因为如果存在字幕,就会导致查询编码器也处理字幕信息了,那么此时不就没有所谓的前五轮训练查询-视频分支的吗?) 我不知道我的理解正确不?我对着一部分很困惑,期望得到你的回复