[AAAI2024] Spatial Transform Decoupling for Oriented Object Detection

Abstract

Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24\% mAP) and HRSC2016 (98.55\% mAP), which demonstrates the effectiveness of the proposed method. Source code is enclosed in the supplementary material. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.

Published paper in AAAI2024 is available at https://ojs.aaai.org/index.php/AAAI/article/view/28502.

Full paper is available at https://arxiv.org/abs/2308.10561.

Results and models

All models, logs and submissions is available at pan.baidu.com.

Password of pan.baidu.com: STDC

All models can be downloaded in release mode now!

Imagenet MAE pre-trained ViT-S backbone: mae_vit_small_800e.pth

Imagenet MAE pre-trained ViT-B backbone: mae_pretrain_vit_base_full.pth or official MAE weight

Imagenet MAE pre-trained HiViT-B backbone: mae_hivit_base_dec512d8b_hifeat_p1600lr10.pth

DOTA-v1.0 (multi-scale)

Model	mAP	Angle	lr schd	Batch Size	Configs	Models	Logs	Submissions
STD with Oriented RCNN and ViT-B	81.66	le90	1x	1*8	cfg	model	log	submission
STD with Oriented RCNN and HiViT-B	82.24	le90	1x	1*8	cfg	model	log	submission

HRSC2016

Model	mAP(07)	mAP(12)	Angle	lr schd	Batch Size	Configs	Models	Logs
STD with Oriented RCNN and ViT-B	90.67	98.55	le90	3x	1*8	cfg	model	log
STD with Oriented RCNN and HiViT-B	90.63	98.20	le90	3x	1*8	cfg	model	log

Installation

MMRotate depends on PyTorch, MMCV and MMDetection. Please refer to Install Guide for more detailed instruction. Below are quick steps for installation.

conda create -n openmmlab python=3.7 -y
conda activate openmmlab
conda install pytorch=1.7.0 torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install openmim
mim install mmcv-full==1.6.1
mim install mmdet==2.25.1
git clone https://github.com/open-mmlab/mmrotate.git
cd mmrotate
pip install -r requirements/build.txt
pip install -v -e .
cd ../
# 
# pip install timm apex
# 
git clone https://github.com/yuhongtian17/Spatial-Transform-Decoupling.git
cp -r Spatial-Transform-Decoupling/mmrotate-main/* mmrotate/

If you want to conduct offline testing on the DOTA-v1.0 dataset (for example, our ablation study is trained on the train-set and tested on the val-set), we recommend using the official DOTA devkit. Here we modify the evaluation code for ease of use.

git clone https://github.com/CAPTAIN-WHU/DOTA_devkit.git
cd DOTA_devkit
sudo apt install swig
swig -c++ -python polyiou.i
python setup.py build_ext --inplace
cd ../
# 
git clone https://github.com/yuhongtian17/Spatial-Transform-Decoupling.git
cp Spatial-Transform-Decoupling/DOTA_devkit-master/dota_evaluation_task1.py DOTA_devkit/

Example usage:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_train.sh ./configs/rotated_faster_rcnn/rotated_faster_rcnn_r50_fpn_1x_dota_le90.py 8
# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup ./tools/dist_train.sh ./configs/rotated_faster_rcnn/rotated_faster_rcnn_r50_fpn_1x_dota_le90.py 8 > nohup.log 2>&1 &
# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_test.sh ./configs/rotated_faster_rcnn/rotated_faster_rcnn_r50_fpn_1x_dota_le90.py ./work_dirs/rotated_faster_rcnn_r50_fpn_1x_dota_le90/epoch_12.pth 8 --eval mAP
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_test.sh ./configs/rotated_faster_rcnn/rotated_faster_rcnn_r50_fpn_1x_dota_le90.py ./work_dirs/rotated_faster_rcnn_r50_fpn_1x_dota_le90/epoch_12.pth 8 --format-only --eval-options submission_dir="./work_dirs/Task1_rotated_faster_rcnn_r50_fpn_1x_dota_le90_epoch_12/"
python "../DOTA_devkit/dota_evaluation_task1.py" --mergedir "./work_dirs/Task1_rotated_faster_rcnn_r50_fpn_1x_dota_le90_epoch_12/" --imagesetdir "./data/DOTA/val/" --use_07_metric True

Acknowledgement

Please also support two representation learning works on which this work is based:

imTED: paper code

HiViT: paper code

Also thanks to Xue Yang for his inspiration in the field of Oriented Object Detection.

News

VMamba-DOTA is available at here! A brand new model!

Citation

@inproceedings{yu2024spatial,
  title={Spatial Transform Decoupling for Oriented Object Detection},
  author={Yu, Hongtian and Tian, Yunjie and Ye, Qixiang and Liu, Yunfan},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={6782--6790},
  year={2024}
}

License

STD is released under the License.

yuhongtian17 / Spatial-Transform-Decoupling

readme