A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer
Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting
(11/05/2022) TransDETR, a better transformer-based video text spotting method has been launched.
(08/04/2021) Refactoring the code.
(10/20/2021) The complete code has been released .
Methods | MOTA | MOTP | IDF1 | Mostly Matched | Partially Matched | Mostly Lost |
---|---|---|---|---|---|---|
TransVTSpotter | 45.75 | 73.58 | 57.56 | 658 | 611 | 647 |
The codebases are built on top of Deformable DETR and TransTrack.
Install and build libs
git clone git@github.com:weijiawu/TransVTSpotter.git
cd TransVTSpotter
cd models/ops
python setup.py build install
cd ../..
pip install -r requirements.txt
Prepare datasets and annotations
COCOTextV2 dataset is available in COCOTextV2.
python3 track_tools/convert_COCOText_to_coco.py
ICDAR2015 dataset is available in icdar2015.
python3 track_tools/convert_ICDAR15video_to_coco.py
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2 --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth
python3 track_tools/Pretrain_model_to_mot.py
The pre-trained model is available [Baidu Netdisk](https://pan.baidu.com/s/1E_srg-Qm8yMgmh6AVlw0Tg), password:59w8.
[Google Netdisk](https://drive.google.com/file/d/1CPqE9D46vlOeO41sWIEXBjAnlbe5hSmG/view?usp=sharing)
And the MOTA 44% can be found [here](https://pan.baidu.com/s/1u3u_P775ReuafRZ4V2amDg) password:xnlw.
[Google Netdisk](https://drive.google.com/file/d/1HO59jwzL33NYtHlhzKqwq7fYYuf9xCzH/view)
4. Train TransVTSpotter
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2 --with_box_refine --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth
5. Inference and Visualize TransVTSpotter
python3 main_track.py --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 1 --resume ./output/ICDAR15/checkpoint.pth --eval --with_box_refine --num_queries 300 --track_thresh 0.3
python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py
## License
TransVTSpotter is released under MIT License.
## Citing
If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:
@article{wu2021opentext, title={A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer}, author={Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou}, journal={35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks}, year={2021} }