yixinL7 / BRIO

ACL 2022: BRIO: Bringing Order to Abstractive Summarization
333 stars 43 forks source link

Ranking Loss Question #16

Closed griff4692 closed 2 years ago

griff4692 commented 2 years ago

Hi - Thanks for the great code. I've been trying to re-implement BRIO in my HuggingFace fork, but unable to get it to work.

I'm curious what this line in RankingLoss is doing:

TotalLoss = loss_func(score, score, ones)

One possibility is that I haven't yet included the gold reference as part of the ranking loss, which might explain why the contrast loss is causing the gold standard MLE loss to rise too highly. I will add that but was also curious about the above function. Thank you!!

griff4692 commented 2 years ago

I also had a question about

loss_func = torch.nn.MarginRankingLoss(margin * i)

In the paper, it says

is the margin multiplied by the difference in rank between the candidates

It appears that the margin is based solely on the rank or index of the higher rated candidate. Is this correct?

yixinL7 commented 2 years ago

Hi, thank you for your interest in our work.

I wanted to note that this loss function is adapted from MatchSum.

For TotalLoss, they have an explanation here is to avoid that some special samples will not go into the following for loop. I always think of it as just a placeholder.

For your second question about the margin, please refer to this thread: https://github.com/yixinL7/SimCLS/issues/6.

Please let me know if you have more questions.

griff4692 commented 2 years ago

Ahh thanks Yixin -

Yes, I've noticed it's the same pairwise calculation from MatchSum. I see with TotalLoss -- just wanted to make sure it was meant to be an empty calculation.

I'm curious if you have any data comparing this pairwise ranking with other objectives:

Contrastive Loss: align positives in decoder latent space (CLIFF) ConSeq (Unlikelihood) Loss: CONSEQ

I'm working on a comparison of methods / metrics / positive-negative selection strategies but not for news summarization. It's interesting to see if adjusting the likelihood (as in unlikelihood and BRIO) is more effective than simple aligning positive decoder states (CLIFF paper, other non-summarization contrastive learning papers).

yixinL7 commented 2 years ago

Hi Griffin, I have also found this comparison very interesting! My guess is adjusting the likelihood has a more direct impact on the decoding output than adjusting the latent representation, but I haven't tried to compare them empirically myself. I'm looking forward to seeing your work on this!