weijiawu / TransDETR

[IJCV 2024] TransDETR: End-to-end Video Text Spotting with Transformer
102 stars 11 forks source link

Some questions about your proposed Rotated RoI #13

Closed mjq11302010044 closed 2 years ago

mjq11302010044 commented 2 years ago

image Could you please verify the difference of your Rotated RoI and the Rotated RoI Pooling and Alignment operations proposed in [1],[2] and [3]? Though your proposed Rotated RoI is different from RoIRotate, (you mentioned you use bilinear interpolation to map the feature grid) It is the same component that has been proposed in [1-3]. If there is no difference between the two, please cite these papers in your paper and remove the "propose" claim.

[1] Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., & Xue, X. (2018). Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11), 3111-3122. [2] He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5020-5029). [3] Ma, J. (2020). Rrpn++: Guidance towards more accurate scene text detection. arXiv preprint arXiv:2009.13118.

weijiawu commented 2 years ago

image Could you please verify the difference of your Rotated RoI and the Rotated RoI Pooling and Alignment operations proposed in [1],[2] and [3]? Though your proposed Rotated RoI is different from RoIRotate, (you mentioned you use bilinear interpolation to map the feature grid) It is the same component that has been proposed in [1-3]. If there is no difference between the two, please cite these papers in your paper and remove the "propose" claim.

[1] Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., & Xue, X. (2018). Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11), 3111-3122. [2] He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5020-5029). [3] Ma, J. (2020). Rrpn++: Guidance towards more accurate scene text detection. arXiv preprint arXiv:2009.13118.

Thank you for your suggestion. I will cite these papers and present further discussion.

  1. RRPN[1] uses max pooling to map the feature, which is different from ours.
  2. [2] is not an affine transformation-based method.
  3. RRPN++[3] is the extended version of RRPN, and adopts bilinear interpolation to replace the max pooling.

Therefore, our Rotated RoI seems to be similar to the RRPN++. We will remove the "propose" claim and modify to the claim "To enable end-to-end training, similar to RRPN, RRPN++, we adopt the Rotated Region-of-Interest~(Rotated RoI) to extract the features of each text from the output feature map of upsampling. "

mjq11302010044 commented 2 years ago

@weijiawu Thanks for solving my doubts about the papers. Overall, it is still a good work for ECCV2022. Congrats.

weijiawu commented 2 years ago

Thanks. I have updated the arxiv version and the camera ready for the content. Please get in touch with me if there is any question.