wzk1015 / CNMT

[AAAI 2021] Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
https://arxiv.org/pdf/2012.03662.pdf
MIT License
24 stars 5 forks source link

feature extraction #7

Open Caroline0728 opened 2 years ago

Caroline0728 commented 2 years ago

Hello, thank you so much for sharing the code! Can the code of feature extraction be shared? Did you perform feature extraction on the original data set Textcaps?

wzk1015 commented 2 years ago

Hi, for feature extraction we follow M4C. It is performed on the original dataset, noting that both TextVQA and TextCaps use images from OpenImages, so features extracted on TextVQA can be directly used on TextCaps.

wzk1015 commented 2 years ago

For OCR systems you can refer to their official repos (CRAFT, ABCNet and four-stage STR).

Caroline0728 commented 2 years ago

Thank you for your reply! Are you referring this link?facebookresearch/mmf

wzk1015 commented 2 years ago

Yes, extract_features_vmb.py and extract_ocr_frcn_feature.py to be precise

Caroline0728 commented 2 years ago

Thank you very much! Let me try ~!

Caroline0728 commented 2 years ago

I'm sorry to bother you again. There are a lot of files in MMF, but the configuration file for feature extraction provides URL links, which are invalid now. May I ask where to modify the loading path of the original data set? image

wzk1015 commented 2 years ago

I think you should probably open issues in MMF repo for this

Caroline0728 commented 2 years ago

Yes, extract_features_vmb.py and extract_ocr_frcn_feature.py to be precise

Hello, first of all, thank you for the code you provided. I still want to ask you some questions. As for the results obtained after the addition of text recognition and detection algorithm in pictures, how to further generate the. Npy file required by the model? Could you give me some ideas? I would like to try to reproduce your code by myself, thank you very much for your reply at your convenience! ~

wzk1015 commented 2 years ago

Yes, extract_features_vmb.py and extract_ocr_frcn_feature.py to be precise

Hello, first of all, thank you for the code you provided. I still want to ask you some questions. As for the results obtained after the addition of text recognition and detection algorithm in pictures, how to further generate the. Npy file required by the model? Could you give me some ideas? I would like to try to reproduce your code by myself, thank you very much for your reply at your convenience! ~

You can read the .npy file with numpy,

>>> import numpy as np
>>> a=np.load('imdb_val_filtered_by_image_id.npy')
>>> a[1].keys()
dict_keys(['image_id', 'image_classes', 'flickr_original_url', 'flickr_300k_url', 'image_width', 'image_height', 'set_name', 'image_name', 'image_path', 'feature_path', 'ocr_tokens', 'ocr_info', 'ocr_normalized_boxes', 'obj_normalized_boxes', 'caption_id', 'caption_str', 'caption_tokens', 'reference_strs', 'reference_tokens', 'ocr_confidence'])

Most of them come from imdb file in M4C-Captioner except for ocr_confidence

dict_keys(['image_id', 'image_classes', 'flickr_original_url', 'flickr_300k_url', 'image_width', 'image_height', 'set_name', 'image_name', 'image_path', 'feature_path', 'ocr_tokens', 'ocr_info', 'ocr_normalized_boxes', 'obj_normalized_boxes', 'caption_id', 'caption_str', 'caption_tokens', 'reference_strs', 'reference_tokens'])

'image_id', 'image_classes', 'flickr_original_url', 'flickr_300k_url', 'image_width', 'image_height', 'set_name', 'image_name', 'image_path', 'caption_id', 'caption_str', 'caption_tokens', 'reference_strs', 'reference_tokens' are from the json file in original TextCaps dataset (or need some processing)

'feature_path', 'ocr_tokens', 'ocr_info', 'ocr_normalized_boxes', 'obj_normalized_boxes',ocr_confidence` are from your detection and OCR result. Print outa[1]` and you will understand them