youweiliang / evit

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations
Apache License 2.0
162 stars 19 forks source link

Why is normal VIT faster than EVIT when batchsize is 1? #5

Open kritohyh opened 2 years ago

kritohyh commented 2 years ago

Your research is very meaningful. But when I turn batchsize down, why doesn't EVit perform so well? I hope you can dispel my doubts.

kritohyh commented 2 years ago

May be the delay caused by copy operation with torch.gather?

youweiliang commented 2 years ago

Hi, thanks for your interest in our work!

Yes, compared to vanilla ViTs, EViT has the gather and topk operations that require additional GPU kernel launch, whose computational overhead would be non-negligible when the batch size is 1.

kritohyh commented 2 years ago

Hi, thanks for your interest in our work!

Yes, compared to vanilla ViTs, EViT has the gather and topk operations that require additional GPU kernel launch, whose computational overhead would be non-negligible when the batch size is 1.

That said, deploying EViT for real-time infering on video streams may require overcoming the overhead of additional operations. Is it possible to use trt or custom operator to solve this problem? I'd like your advice!

youweiliang commented 2 years ago

The overhead may be caused by the complement_idx in helpers.py. I will check it soon.

For your use case in video streams, can't the video be viewed as a series of images so that the batch size is greater than 1?

kritohyh commented 2 years ago

The overhead may be caused by the complement_idx in helpers.py. I will check it soon.

For your use case in video streams, can't the video be viewed as a series of images so that the batch size is greater than 1?

Looking forward to your good results! In my scenario, due to the non-dense frame extraction, I need to test the inference speed of single frame. But because my image resolution is large enough that the images are partitioned into large batchsizes, EViT still delivers its high performance.