原文:
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
版本: OS: Win10 21H2 Zotero Version(Help->About Zotero) 6.0.13 Addon Version(Tools->Add-ons) Release 0.8.23
原文: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
译文: 尽管变压器体系结构已成为自然语言处理任务的事实上的标准,但其在计算机视觉上的应用仍然有限。在视觉上,注意要么与卷积网络一起应用,要么用于替代卷积网络的某些组成部分,同时保持其整体结构。我们表明,这种对CNN的依赖不是必需的,并且直接应用于图像贴片序列的纯变压器可以很好地在图像分类任务上执行。当对大量数据进行预训练并转移到多个中型或小图像识别基准(Imagenet,Cifar-100,VTAB等)时,视觉变压器(VIT)与最新的结果相比,取得了良好的结果艺术卷积网络,同时需要培训的计算资源少得多。
网页版译文: 虽然 Transformer 架构已成为自然语言处理任务的事实标准,但其在计算机视觉中的应用仍然有限。 在视觉上,注意力要么与卷积网络结合使用,要么用于替换卷积网络的某些组件,同时保持其整体结构不变。 我们表明,这种对 CNN 的依赖是不必要的,直接应用于图像块序列的纯变换器可以在图像分类任务上表现得非常好。 当对大量数据进行预训练并转移到多个中型或小型图像识别基准(ImageNet、CIFAR-100、VTAB 等)时,与 state-of-the- 相比,Vision Transformer (ViT) 获得了出色的结果 艺术卷积网络,同时需要更少的计算资源来训练。