ymy-k / Hi-SAM

[arXiv preprint] Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
Apache License 2.0
185 stars 10 forks source link
hierarchical-text-segmentation high-quality-text-stroke-segmentation sam segment-anything segment-anything-model

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation.

[arXiv preprint]

This is the official repository for Hi-SAM, a unified hierarchical text segmentation model. Refer to our paper for more details.

:sparkles: Highlight

overview

example

:bulb: Overview of Hi-SAM

Hi-SAM

:fire: News

:hammer_and_wrench: Install

Recommended: Linux Python 3.8 Pytorch 1.10 CUDA 11.1

conda create --name hi_sam python=3.8 -y
conda activate hi_sam
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
git clone https://github.com/ymy-k/Hi-SAM.git
cd Hi-SAM
pip install -r requirements.txt

:pushpin: Checkpoints

You can download the following model weights and put them in pretrained_checkpoint/.

Model Used Dataset Weights fgIOU F-score
SAM-TSS-B Total-Text OneDrive 80.93 86.25
SAM-TSS-L Total-Text OneDrive 84.59 88.69
SAM-TSS-H Total-Text OneDrive 84.86 89.68
Model Used Dataset Weights fgIOU F-score
SAM-TSS-B TextSeg OneDrive 87.15 92.81
SAM-TSS-L TextSeg OneDrive 88.77 93.79
SAM-TSS-H TextSeg OneDrive 88.96 93.87
Model Used Dataset Weights fgIOU F-score
SAM-TSS-B HierText OneDrive 73.39 81.34
SAM-TSS-L HierText OneDrive 78.37 84.99
SAM-TSS-H HierText OneDrive 79.27 85.63
Model Used Dataset Weights Stroke F-score Word F-score Text-Line F-score Paragraph F-score
Efficient Hi-SAM-S HierText OneDrive 75.60 waiting results
Hi-SAM-B HierText OneDrive 79.78 78.34 82.15 71.15
Hi-SAM-L HierText OneDrive 82.90 81.83 84.85 74.49
Hi-SAM-H HierText OneDrive 83.36 82.86 85.30 75.97

The results of Hi-SAM on the test set are reported here.

:star: Note:

  1. For faster downloading and saving storage, above checkpoints do not contain the parameters in SAM's ViT image encoder. Please follow segment-anything to achieve sam_vit_b_01ec64.pth, sam_vit_l_0b3195.pth, sam_vit_h_4b8939.pth and put them in pretrained_checkpoint/ for loading the frozen parameters in ViT image encoder.
  2. To train Hi-SAM in yourself, in addition to download the SAM weights, please also download the isolated mask decoder weights and put them in pretrained_checkpoint/ for initializing H-Decoder (or you can separate the mask decoder part from SAM weights in yourself). vit_b_maskdecoder.pth & vit_l_maskdecoder.pth & vit_h_maskdecoder.pth from segment-anything, vit_s_maskdecoder.pth from EfficientSAM. For example, if you want to train Hi-SAM-L, it looks like this in pretrained_checkpoint/:
|- pretrained_checkpoint
|  |- sam_vit_l_0b3195.pth
|  └  vit_l_maskdecoder.pth

:arrow_forward: Usage

1. Visualization Demo

1.1 Text stroke segmentation (for SAM-TSS & Hi-SAM):

python demo_hisam.py --checkpoint pretrained_checkpoint/sam_tss_l_hiertext.pth --model-type vit_l --input demo/2e0cb33320757201.jpg --output demo/

To achieve better quality on small texts using sliding window, run the following script:

python demo_hisam.py --checkpoint pretrained_checkpoint/sam_tss_l_hiertext.pth --model-type vit_l --input demo/2e0cb33320757201.jpg --output demo/2e0cb33320757201_sliding.png --patch_mode

1.2 Word, Text-line, and Paragraph Segmentation (for Hi-SAM)

Run the following script for promptable segmentation on demo/img293.jpg:

python demo_hisam.py --checkpoint pretrained_checkpoint/hi_sam_l.pth --model-type vit_l --input demo/img293.jpg --output demo/ --hier_det

2. Evaluation

Please follow data_preparation.md to prepare the datasets at first.

2.1 Text Stroke Segmentation (for SAM-TSS & Hi-SAM)

If you only want to evaluate the text stroke segmentation part performance, run the following script:

python -m torch.distributed.launch --nproc_per_node=8 train.py --checkpoint <saved_model_path> --model-type <select_vit_type> --val_datasets hiertext_test --eval

If you want to evaluate the performance on HierText with sliding window inference, run the following scripts:

mkdir img_eval
python demo_hisam.py --checkpoint <saved_model_path> --model-type <select_vit_type> --input datasets/HierText/test/ --output img_eval/ --patch_mode
python eval_img.py

Using sliding window takes a relatively long time. For faster inference, you can divide the test images into multiple folders and conduct inference for each folder with an individual GPU.

2.2 Hierarchical Text Segmentation (for Hi-SAM)

For stroke level performance, please follow section 2.1. For word, text-line, and paragraph level performance on HierText, please follow the subsequent steps.

Step 1: run the following scripts to get the required jsonl file:

python demo_amg.py --checkpoint <saved_model_path> --model-type <select_vit_type> --input datasets/HierText/test/ --total_points 1500 --batch_points 100 --eval
cd hiertext_eval
python collect_results.py --saved_name res_1500pts.jsonl

For faster inference, you can divide the test or validation images into multiple folders and conduct inference for each folder with an individual GPU.

Step 2: if you conduct inference on the test set of HierText, please submit the final jsonl file to the official website to achieve the evaluation metrics. If you conduct inference on the validation set: (1) follow HierText repo to download and achieve the validation ground-truth validation.jsonl. Put it in hiertext_eval/gt/. (2) Run the following script borrowed from HierText repo to get the evaluation metrics:

python eval.py --gt=gt/validation.jsonl --result=res_1500pts.jsonl --output=score.txt --mask_stride=1 --eval_lines --eval_paragraphs
cd ..

The evaluation process will take about 20 minutes. The evaluation metrics will be saved in thet file determined by --output.

3. Training

Please follow data_preparation.md to prepare the datasets and prepare the required pretrained weights mentioned in section Checkpoints.

3.1 Training Hi-SAM

For example, to train Hi-SAM-L on HierText:

python -m torch.distributed.launch --nproc_per_node=8 train.py --checkpoint ./pretrained_checkpoint/sam_vit_l_0b3195.pth --model-type vit_l --output work_dirs/hi_sam_l/ --batch_size_train 1 --lr_drop_epoch 130 --max_epoch_num 150 --train_datasets hiertext_train --val_datasets hiertext_val --hier_det --find_unused_params

The released models are trained on 8 V100 (32G) GPUs (Hi-SAM-L takes about 2 days). The saved models after the final epoch are used for evaluation.

3.2 Training SAM-TSS

For example, to train SAM-TSS-L on TextSeg:

python -m torch.distributed.launch --nproc_per_node=8 train.py --checkpoint ./pretrained_checkpoint/sam_vit_l_0b3195.pth --model-type vit_l --output work_dirs/sam_tss_l_textseg/ --batch_size_train 1 --max_epoch_num 70 --train_datasets textseg_train --val_datasets textseg_val

The released models are trained on 8 V100 (32G) GPUs (SAM-TSS only takes a few hours). The best models on validation set are used for evaluation.

:eye: Applications

1. Promptable Multi-granularity Text Erasing and Inpainting

Combining Hi-SAM with Stable-Diffusion-inpainting for interactive text erasing and inpainting (click a single-point for word, text-line, or paragraph erasing and inpainting). You can see this project to implement the combination of Hi-SAM and Stable-Diffusion.

2. Text Detection

Only word level or only text-line level text detection. Directly segment contact text instance region instead of the shrunk text kernel region.

spotting

Two demo models are provided here: word_detection_totaltext.pth (trained on Total-Text, only for word detection). line_detection_ctw1500.pth, (trained on CTW1500, only for text-line detection). Put them in pretrained_checkpoint/. Then, for example, run the following script for word detection (only for the detection demo on Total-Text):

python demo_text_detection.py --checkpoint pretrained_checkpoint/word_detection_totaltext.pth --model-type vit_h --input demo/img643.jpg --output demo/ --dataset totaltext

For text-line detection (only for the detection demo on CTW1500):

python demo_text_detection.py --checkpoint pretrained_checkpoint/line_detection_ctw1500.pth --model-type vit_h --input demo/1165.jpg --output demo/ --dataset ctw1500

3. Promptable Scene Text Spotting

Combination with a single-point scene text spotter, SPTSv2. SPTSv2 can recognize scene texts but only predicts a single-point position for one instance. Providing the point position as prompt to Hi-SAM, the intact text mask can be achieved. Some demo figures are provided bellow, the green stars indicate the point prompts. The masks are generated by the word detection model in section 2. Text Detection.

spotting

:label: TODO

💗 Acknowledgement

:black_nib: Citation

If you find Hi-SAM helpful in your research, please consider giving this repository a :star: and citing:

@article{ye2024hi-sam,
  title={Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation},
  author={Ye, Maoyuan and Zhang, Jing and Liu, Juhua and Liu, Chenyu and Yin, Baocai and Liu, Cong and Du, Bo and Tao, Dacheng},
  journal={arXiv preprint arXiv:2401.17904},
  year={2024}
}