zlai0 / MAST

MAST: A Memory-Augmented Self-supervised Tracker (CVPR 2020)
https://zlai0.github.io/MAST/
273 stars 32 forks source link

Question about aggregating labels in your paper #2

Closed HowieMa closed 4 years ago

HowieMa commented 4 years ago

Hi, thank you for sharing such an awesome project. I have some questions about the details for your paper.

In the step 4 of Algorithm 1 (Section 3.3.1), you said "the output pixel's labels is determined by aggregating the labels of the ROI pixels". My question is how you treat several frames' ROI.

Do you use the algorithm similar as STM. For example, if the Query's size is H W C (C is the number of channels), and the size of your restricted attention in all previous T Keys is P. Do you calculate an affinity matrix with size (HW * TPP)? Or you just do the same thing as Cycle-consistency. That is, a K-NN strategy and it average all previous predictions.

I really appreciate it if you could solve my concerns. Looking forward to your official code. Thank you!

zlai0 commented 4 years ago

Hi, sorry for the late reply. That's exactly right. We use the algorithm similar to STM. All attention areas are treated as one single large area (instead of T areas) and they should add up to 1 after softmax).

HowieMa commented 4 years ago

Thank you for the detailed answer, your answer and code inspire me a lot! May I ask another question? It seems that your training code only has the first training step, as in your paper "we first pretrain the network with a pair of input frames". Thus could you please share the code that for "finetuning the model with multiple reference frames"? Look forward to your reply, thanks!

zlai0 commented 4 years ago

Thanks for the interest. That's right - I did not add the code because we recently find that the finetuning step is not necessarily needed and the performance increase from it seems just marginal. I was not sure if I should over complicate the code. But yes, I could share the fine-tuning code. It should be pretty simple to implement as well.

houhouhouhou11 commented 3 years ago

@zlai0 Could you share the code that for "finetuning the model with multiple reference frames"? When I train the multiple reference frames, the performance is similar to the single frame. Thanks