What's the essential difference between ConvBert and LSRA?

yuanenming commented 3 years ago

LSRA: Lite Transformer with Long-Short Range Attention.

LSRA also integrates convolution operations into transformer blocks. I'm just wondering what makes ConvBert differ from LSRA.

yuanenming commented 3 years ago

Is that LSRA combines multi-head attention and conv in a multi-branch manner, but ConvBert integrates conv into transformer blocks? if the answer is yes. what are the pros and cons of the above two methods? Do you have experiments?

Thanks a lot!!!

zihangJiang commented 3 years ago

Hi @yuanenming , Thanks for your interest.

LSRA is for machine translation and abstractive summarization. They are combining dynamic conv and multi-head attention in a two-branch manner.

ConvBERT is a pre-training based model that can be fine-tuned on downstream tasks like sentence classification. We also propose a novel span-based dynamic convolution operator and combine it with the self-attention to form the mixed-attention block.

Experiments comparing span-based dynamic conv and dynamic conv can be found in Section 4.3 Table 2 in our paper.

You can find that our span-based dynamic conv is better than dynamic conv in this pre-training based model setting. But it's hard to directly compare LSRA with ConvBERT.

yuanenming commented 3 years ago

Thank you for your timely reply! I will close this issue.

yitu-opensource / ConvBert

What's the essential difference between ConvBert and LSRA? #14