Is FashionIQ evaluation comparable?

helson73 commented 3 years ago

In fashion-iq dataset, there are two captions available, in paper "The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback", they concatenate two captions with special symbol "\<and>" and in doing so they treat two captions as one. So each ref-cap-tgt pair in json file only results in one pair. For more detail, see their released source code However, in your evaluation process, the file "fashion_iq-val-cap.txt" shows that you threat them as two individual pairs. If the results in your paper follow the same evaluation process, I doubt your results are not comparable with early published work.

yanbeic commented 3 years ago

Hi, In this paper, one caption is used as text input in training and evaluation. This follows the same spirit as the previous CVPR19 paper and the FashionIQ paper. Since the training and the evaluation are consistent, this setup should be reasonable.

Note that in the original FashionIQ paper, there are no results reported for the exact same task considered in this paper. Our reported results are obtained by re-implementation of prior methods under the same setup, so the evaluation in our paper is comparable.

helson73 commented 3 years ago

Sorry, I confused the task in their paper ("The fashionIQ dataset : ...") with their FashionIQ challenge.

If all evaluation results in your result chart are obtained by yourself and evaluated under the same protocol, indeed they are comparable, it's my bad to confuse about them.

However, the guys who released FashionIQ dataset also organized a contest named "FashionIQ challenge", I think you mentioned results from this task in your paper as "un-published SOTA". (Although they didn't published them on major journals and conferences, their reports are available.)

The "FashionIQ challenge" is the exact same task you did in your work, except for one thing, the evaluation protocol.

The "FashionIQ challenge" treats two captions together as one text input, as well as all participants of FashionIQ challenge 2019, 2020 follow the same rule. On the official fashionIQ dataset, they released a "starter-kit" for evaluating baseline (TIRG) on fashionIQ dataset, in these "starter-kit", they follows the same rule, treat two captions together as one input.

In perspective of a starter who wants to know about this task at the first time, one usually firstly access the official website, and then one highly likely would use the "starter-kit" as the starter point. If one had read your paper, one might possibly think that the task in your paper is the same task in FashionIQ challenge.

I understand your concern about consistency with other works (CVPR19 paper for instance), but in this case there is already an official challenge of specific dataset exists, you did the same task, mentioned them in your paper (not in the chart, but mentioned anyway), but used a different evaluation protocol without mentioning it directly, which may cause confusion.

If possible, I suggest you put a notice about this difference, at least in the readme file :)

helson73 commented 3 years ago

I have another question about evaluation result table in your paper.

The paper shows large gap between the evaluation result from the original TIRG implementation and the result from the re-implementation on FashionIQ dataset, I wondering what cause this? More specifically, what is difference between original TIRG and your re-implemented one?

yanbeic commented 3 years ago

The difference is the backbone network. ResNet-50 is used on FashionIQ in this paper.

yanbeic / VAL

Is FashionIQ evaluation comparable? #11