yixinL7 / BRIO

ACL 2022: BRIO: Bringing Order to Abstractive Summarization
327 stars 43 forks source link

Apply BRIO to other generation tasks #9

Open HillZhang1999 opened 2 years ago

HillZhang1999 commented 2 years ago

Hi, thanks for this fantastic work. Here is my question: I try to use BRIO in another generation task and re-implement it in Fairseq. However, I find that the performance is relatively poor after incorporating BRIO. I look further into the generation results and find that many results are just a single period. Moreover, the distribution of scores of candidates seems to be isotropic after training with the contrastive loss (I set the hyper-parameters following the CNN setting in your paper), such as the example shown below:

before training with the contrastive loss (16 candidates, sorted):
[-0.2314, -0.2862, -0.2660, -0.2471, -0.2442, -0.2796, -0.2611, -0.2617, -0.2608, -0.2984, -0.2622, -0.5395, -0.5655, -0.4688, -0.5250, -0.5317],

after:
[-1.1421, -1.1402, -1.1290, -1.1524, -1.1554, -1.1483, -1.1415, -1.1476, -1.1527, -1.1472, -1.1538, -1.1437, -1.1555, -1.1722, -1.1440, -1.1427]

Can you give me any advice?

yixinL7 commented 2 years ago

Hi, thank you for your interest in our work. I'd recommend several things:

  1. Following CNNDM setting may not always be suitable depending on the dataset you are working on. There are several hyperparameters needed to be tuned (among others):
    • margin of the contrastive loss
    • scale of the contrastive loss
    • length penalty for calculating the model-predicted probability These hyperparameters can be sensitive (e.g., they are very different for CNNDM and XSum).
  2. For the length penalty, you can start your search from the length penalty used in the original beam search for the MLE-trained baseline.
  3. For others you may need to try a few different numbers. A rule of thumb is to look at the MLE loss during training. If MLE loss becomes too large, it basically means that you haven't found the appropriate setting.
  4. You may also start with training BRIO as a re-ranker by setting the MLE loss as zero, which should give you some ideas of how to set the hyperparameters and if using BRIO on your dataset is going to work at all. But please note that the hyperparameters used for training BRIO as a re-ranker can be different from training it as a generation model - I found that training it as a generation model is more sensitive to the hyperparameters.

Please let me know if you have more questions. Good luck!

Hannibal046 commented 2 years ago

Hi, @yixinL7 , could you please share some insights about the scale parameter ? Why should we need this ? And how to set this hyper parameter ? Thanks

Hannibal046 commented 2 years ago

BTW, how to come up with this eval function for different dataset ? Any criterion ? So much thanks for your amazing work ! https://github.com/yixinL7/BRIO/blob/135f0e5cc5671fe4faa45ff3e05969969686419a/main.py#L411-L416

HillZhang1999 commented 2 years ago

Hi, thank you for your interest in our work. I'd recommend several things:

  1. Following CNNDM setting may not always be suitable depending on the dataset you are working on. There are several hyperparameters needed to be tuned (among others):

    • margin of the contrastive loss
    • scale of the contrastive loss
    • length penalty for calculating the model-predicted probability These hyperparameters can be sensitive (e.g., they are very different for CNNDM and XSum).
  2. For the length penalty, you can start your search from the length penalty used in the original beam search for the MLE-trained baseline.
  3. For others you may need to try a few different numbers. A rule of thumb is to look at the MLE loss during training. If MLE loss becomes too large, it basically means that you haven't found the appropriate setting.
  4. You may also start with training BRIO as a re-ranker by setting the MLE loss as zero, which should give you some ideas of how to set the hyperparameters and if using BRIO on your dataset is going to work at all. But please note that the hyperparameters used for training BRIO as a re-ranker can be different from training it as a generation model - I found that training it as a generation model is more sensitive to the hyperparameters.

Please let me know if you have more questions. Good luck!

Thank you for your advice and kind words! Indeed, I have tried a few different hyperparameters but still couldn't get positive results. I think that the reason may be the characteristic of my task, i.e., grammatical error correction (GEC). Since GEC is a local sequence transduction task, so many candidates in beam search only have minimal differences, which may make the contrastive loss hard to optimize. I also noticed that you used the diverse beam search, but I found that this technique performs poorly in GEC. Can you provide any further advice for using BRIO in local sequence transduction tasks like GEC and text simplification?

Hannibal046 commented 2 years ago

Hi, @HillZhang1999 . I think this may help: https://github.com/yixinL7/SimCLS/issues/14 https://arxiv.org/pdf/1512.02433.pdf ( Since NMT is also a 1-to-n generation task where n is relatively small)

HillZhang1999 commented 2 years ago

@Hannibal046 Thanks a lot!

yixinL7 commented 2 years ago

Thanks @Hannibal046 for the comment.

Hi @HillZhang1999, I'm not very familiar with GEC but I think your observation makes sense. It's very critical to have diverse candidates to make sure the model can learn something meaningful. I'd recommend you try some other decoding algorithms. There are actually several new papers about this. To give some examples: Massive-scale Decoding for Text Generation using Lattices A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

HillZhang1999 commented 2 years ago

Thanks @Hannibal046 for the comment.

Hi @HillZhang1999, I'm not very familiar with GEC but I think your observation makes sense. It's very critical to have diverse candidates to make sure the model can learn something meaningful. I'd recommend you try some other decoding algorithms. There are actually several new papers about this. To give some examples: Massive-scale Decoding for Text Generation using Lattices A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

Dear Yixin, thank you for your help, i will check it out.