yahoo / maaf

Modality-Agnostic Attention Fusion for visual search with text feedback
Apache License 2.0
25 stars 5 forks source link

Purpose of the token 'inadditiontothat' #2

Closed BrandonHanx closed 3 years ago

BrandonHanx commented 3 years ago

@emdodds Hi, all. Thanks for your work.

I wonder what is the purpose of concatenating two captions with 'inadditiontothat'. 159-162 lines bellow. Why not generate two caption pairs (i.e. image with caption_1 as well as image with caption2) ?

https://github.com/yahoo/maaf/blob/e7a49f9b7f3bc2aae7f2fb394993522ccac966ab/datasets/fashioniq.py#L154-L171

Same question for test_queries.

Thanks in advance.

emdodds commented 3 years ago

Thanks for the question. At test time for this dataset there are 2 captions, and we use both to get a better result. There are other ways to do this, but our approach was to concatenate the captions, separated by a special token (we inherited "inadditiontothat" from another codebase, it could just as well have been called or whatever you like). We then wanted to align the training data with the test data by also concatenating the training captions. We did also try sometimes using only one caption during training, but this seems to hurt test set performance relative to always using both captions.

BrandonHanx commented 3 years ago

Thanks for your reply. That totally makes sense.

However, to my best knowledge, this experiment setting is different from previous work (e.g. VAL, and ensemble models in competition). Because you don't seem to have mentioned this in your paper, isn't that unfair to a comparative experiment (i.e. Table 3)?

Please point out if I am wrong.

emdodds commented 3 years ago

Our evaluation protocol for the Fashion IQ dataset is consistent with the Fashion IQ competition; this repo and the associated paper extend our 4th place submission to the 2019 ICCV workshop competition. I can't say whether other groups have handled the dual test captions the same way we did, since I haven't seen this detail addressed in papers. I suspect this method has at least been common, though, since we used a piece of starter code for loading the data for the TIRG model provided by the TIRG authors.

I agree that our efforts to make fair comparisons with other methods are imperfect. From a quick look at the released code for VAL it looks like they may have handled the dual captions differently, and I agree that difference confounds the comparison between the aspects of our methods that the two papers emphasize. Chen et al. may also have used the dataset's attribute labels, which we did not; that would further confound the comparison. There are likely other differences which would require more thorough investigation to untangle.

I'm reasonably confident in our direct comparisons to TIRG, since we ran those experiments ourselves in the same codebase. The other comparisons should be taken with the same skepticism as most such tables in machine learning papers. If you are thinking of using our method or VAL for a real-world application I don't think the difference in performance between the two is likely to be significant.

Thanks again for your question, and please do follow up if there's anything else.

BrandonHanx commented 3 years ago

Thanks so much for your detailed reply.

I am interested in text-guided image retrieval task recently and find it is understudied compared with similar tasks (like person ReID). I totally agree with your statement that both MAAF and VAL won't have big difference performance in the real-world scenario. Just from the point of view of research, I feel that this task needs a more detailed and uniform evaluation criteria.

Thanks for your reply again. And I will follow your successive work. :)