wiesenfa / challengeR

GNU General Public License v2.0
35 stars 8 forks source link

Multi-Task - Ranking concensus #32

Open ReubenDo opened 3 years ago

ReubenDo commented 3 years ago

Hello,

Thank you for your excellent work and nice implementation!

I am currently using the the rank-then-aggregate scheme. The output of aggregateThenRank is the mean ranking for each task for each team. As far as I understood, the consensus function is used to merge the mean rankings, by 1/ computing the ranking of the mean ranking score associated to each task; 2/ averaging these new rankings. However, you could also directly average the mean ranking scores.

For example, let consider two teams A and B and two tasks T1 and T2. Let assume that the mean ranking for A and B are respectively 1.2 and 1.8 on task T1 and 1.6 and 1.4 on task T2. In the current implementation, it seems that the consensus function uses the ranking of the mean ranking, i.e. 1 and 2 on T1 + 2 and 1 on T2, leading to final ranking scores of 1.5 for both teams. However, previous challenges (e.g., BraTS) seem to directly average the mean rankings.

The ranking scheme followed during the BraTS 2017 and 2018 comprised the ranking of each team relative to its competitors for each of the testing subjects, for each evaluated region (i.e., AT, TC, WT), and for each measure (i.e., Dice and Hausdorff (95%)). For example, in BraTS 2018, each team was ranked for 191 subjects, for 3 regions, and for 2 metrics, which resulted in 1146 individual rankings. The final ranking score (FRS) for each team was then calculated by firstly averaging across all these individual rankings for each patient (i.e., Cumulative Rank ), and then averaging these cumulative ranks across all patients for each participating team.

This approach would give a different ranking: A and B would respectively have a ranking score of 1.4 and 1.6 and thus A will be the winner.

I think that both approaches are valid but I was wondering if there was a specific reason that explained why you chose to average the rankings of the mean rankings instead of averaging the mean rankings.

Cheers, Reuben

wiesenfa commented 3 years ago

Dear @ReubenDo , Sorry for the late reply and thanks for your interest in our work.

The output of aggregateThenRank is the mean ranking for each task for each team.

To avoid confusion, you refer to rankThenAggregate.

Yes, you are right, in the case of rankThenAggregate both approaches are valid. The reason for the current implementation is mainly that it is more general and independently works no matter whether rank-then-aggregate or aggregate-then-rank has been used before and follows the theory of consensus rankings. Note that the consensus ranking is not necessarily obtained by the mean ranks across tasks, but in general you try to obtain the ranking which is most similar (smallest distance) to the ranking lists of all tasks (where using the mean is a special case).

With respect to your example, you might argue that the implementation looses precision in not using aggregated ranks (but essentially rounding them to integers). On the other hand, you might also argue each algorithm is best in one of the tasks, so each of them won in one task. Goal of a consensus ranking would then be to summarize this information (each algorithm best in one task) and you don’t want to over-interpret the small difference in mean ranks. I guess this is also a question whether you want to „force“ to have a winner no matter how small the difference is between the algorithms or whether you only want to distinguish algorithms if one is clearly better (this is a similar question in significance ranking where there will be often multiple winners if performance is close and number of cases small). In the end I would argue that we should not only look at the final (consensus) ranks because they might be oversimplified, but also look at distributions of performances and discuss them. And this is actually exactly what you did with your example: Thinking about what has led to a particular ranking and being critical about the final results. Our tool intends to give some assistance for this. Note that in case of rankThenAggregate using median ranks instead of mean ranks which might also be a popular choice, you will always (at least in the default handling of ties) get integer median ranks and there would not be the described ambiguity in the consensus ranking.

So thank you for your comment and this clarification, you are absolutely right. I appreciate such discussions!

Best wishes, Manuel