Open piperino11 opened 5 years ago
For each epoch the training caption will change. It will sample 1 of the 20 captions everytime when you get item from video dataset, you can check out the dataloader.py file
Hey @chongkewu hope you are doing well. I have a query hope you have a answer. For each video we have 20 refrence captions so from your above ans what i understand is that for every epoch it will select randomly one captions from available 20 captions. Isn't ?
Yes, that is correct
On Mon, Mar 23, 2020 at 10:39 PM Alok singh notifications@github.com wrote:
External Email
Hey @chongkewu https://github.com/chongkewu hope you are doing well. I have a query hope you have a answer. For each video we have 20 refrence captions so from you above ans what i understand is that for every epoch it will select randomly one captions from available 20 captions. Isn't ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xiadingZ/video-caption.pytorch/issues/37#issuecomment-603027561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKQQQMPQ4SIT2I5YYNTKRQDRJBBRXANCNFSM4ISUBTVQ .
thank you @chongkewu. Do you think that in this way the model will be trained sufficiently?
For the challenge I think it is enough. A video has many candidates and the model just need to output one sentence.
On Mon, Mar 23, 2020 at 11:49 PM Alok singh notifications@github.com wrote:
External Email
thank you @chongkewu https://github.com/chongkewu. Do you think that in this way the model will be trained sufficiently?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xiadingZ/video-caption.pytorch/issues/37#issuecomment-603056347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKQQQMPM5GI5HQ57BX6UFHTRJBJVTANCNFSM4ISUBTVQ .
@chongkewu thank you so much for your instant replies. Will try some new approaches and will let you inform about the performance.
@chongkewu After selecting the caption randomly do we training the model in such a
X1 X2(text sequence) y(word)
-----------------------------------------------------------------
image startseq, little
image startseq, little, girl
image startseq, little, girl, running
image startseq, little, girl, running, in
image startseq, little, girl, running, in, field
image startseq, little, girl, running, in, field, endseq
or just directly passing image and whole caption to the model?
Msr vtt dataset have 10000 videos and 20 captions for each video but in this implementation only a video-caption pair in train phase is considered. Therefore in total <= 10000 example for train. someone has seen the same thing???? has anyone changed the code?