Closed HuazhangHu closed 1 year ago
The parameters of the caption encoder and query encoder are shared. During the training of the query-video branch, we train the parameters of the query encoder, video encoder, and the interaction module. Once this branch is trained, we freeze the query encoder and share its parameters with the caption encoder. We then only train the multi-head attention (MHA) module added after the caption embeddings to obtain the enhanced global caption embedding.
I am very confused: the caption encoder and query encoder share weights, so what are the optimized parameters for calculating QC matching? Why do we need to pass the capture embedding of CxD through MHA and multiply it with query embedding