youweiliang / evit

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations
Apache License 2.0
169 stars 19 forks source link

Comparison with Evo-ViT #1

Closed Longday0923 closed 2 years ago

Longday0923 commented 2 years ago

Thx for your brilliant works on accelerating ViTs!

Months ago I've read this paper: Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, and now found that you have similar model structures: preserving informative tokens while aggregating uninformative ones. Have you ever comparing with this work? I found few counter works about faster inference in Figure 4, referring to the surveys of ViTs recently, there are many more fast ViTs you could compare with.

Also, as far as I know, there are plenty works in the field of NLP, based on BERT, of prototyping and aggregating tokens. Those could also be good comparisions.

Thanks for your considerations, and looking forward to your reply.

youweiliang commented 2 years ago

Hi, thanks for your interest in our work.

Since the work you mentioned, namely Evo-ViT, is a concurrent work of our ICLR submission, we were unable to include it in our paper. For a pure comparison purpose, it seems that the performance of Evo-ViT is similar to that of EViT, which can be seen from the experimental results such as Table 1 in Evo-ViT and Table 8 in EViT. Specifically, Evo-ViT-DeiT-B achieves 54.5% speedup and 81.3% top-1 acc, while EViT/0.7-DeiT-B achieves 59% speedup and 81.3% top-1 acc. On the othe hand, EViT is simpler than Evo-ViT conceptually and in implementation. We would like to discuss the connections and differences between EViT and Evo-ViT in future work.

It should be noted that our work is not intended to be a SOTA model among ViTs, but merely a demonstration of an intriguing property of the attention map in ViTs and how a simple token selection method can benefit ViT models.

For the Transformer models in the NLP field, the performance cannot be compared directly with ours as they involve different tasks and different evaluation metrics.

Thanks and regards, EViT Authors