weijiawu / TransVTSpotter

A new video text spotting framework with Transformer
77 stars 11 forks source link

TransVTSpotter: End-to-end Video Text Spotter with Transformer

License: MIT

Introduction

A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer

Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting

Updates

ICDAR2015(video) Tracking challenge

Methods MOTA MOTP IDF1 Mostly Matched Partially Matched Mostly Lost
TransVTSpotter 45.75 73.58 57.56 658 611 647

Notes

Demo

Installation

The codebases are built on top of Deformable DETR and TransTrack.

Requirements

Steps

  1. Install and build libs

    git clone git@github.com:weijiawu/TransVTSpotter.git
    cd TransVTSpotter
    cd models/ops
    python setup.py build install
    cd ../..
    pip install -r requirements.txt
  2. Prepare datasets and annotations

COCOTextV2 dataset is available in COCOTextV2.

python3 track_tools/convert_COCOText_to_coco.py

ICDAR2015 dataset is available in icdar2015.

python3 track_tools/convert_ICDAR15video_to_coco.py
  1. Pre-train on COCOTextV2
    
    python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2  --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth

python3 track_tools/Pretrain_model_to_mot.py

The pre-trained model is available [Baidu Netdisk](https://pan.baidu.com/s/1E_srg-Qm8yMgmh6AVlw0Tg), password:59w8.
[Google Netdisk](https://drive.google.com/file/d/1CPqE9D46vlOeO41sWIEXBjAnlbe5hSmG/view?usp=sharing)

And the MOTA 44% can be found [here](https://pan.baidu.com/s/1u3u_P775ReuafRZ4V2amDg) password:xnlw.
[Google Netdisk](https://drive.google.com/file/d/1HO59jwzL33NYtHlhzKqwq7fYYuf9xCzH/view)

4. Train TransVTSpotter

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2 --with_box_refine --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth


5. Inference and Visualize TransVTSpotter

Inference

python3 main_track.py --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 1 --resume ./output/ICDAR15/checkpoint.pth --eval --with_box_refine --num_queries 300 --track_thresh 0.3

Visualize

python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py


## License

TransVTSpotter is released under MIT License.

## Citing

If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{wu2021opentext, title={A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer}, author={Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou}, journal={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks}, year={2021} }