Open Caroline0728 opened 2 years ago
Hi, for feature extraction we follow M4C. It is performed on the original dataset, noting that both TextVQA and TextCaps use images from OpenImages, so features extracted on TextVQA can be directly used on TextCaps.
For OCR systems you can refer to their official repos (CRAFT, ABCNet and four-stage STR).
Thank you for your reply! Are you referring this link?facebookresearch/mmf
Yes, extract_features_vmb.py and extract_ocr_frcn_feature.py to be precise
Thank you very much! Let me try ~!
I'm sorry to bother you again. There are a lot of files in MMF, but the configuration file for feature extraction provides URL links, which are invalid now. May I ask where to modify the loading path of the original data set?
I think you should probably open issues in MMF repo for this
Yes, extract_features_vmb.py and extract_ocr_frcn_feature.py to be precise
Hello, first of all, thank you for the code you provided. I still want to ask you some questions. As for the results obtained after the addition of text recognition and detection algorithm in pictures, how to further generate the. Npy file required by the model? Could you give me some ideas? I would like to try to reproduce your code by myself, thank you very much for your reply at your convenience! ~
Yes, extract_features_vmb.py and extract_ocr_frcn_feature.py to be precise
Hello, first of all, thank you for the code you provided. I still want to ask you some questions. As for the results obtained after the addition of text recognition and detection algorithm in pictures, how to further generate the. Npy file required by the model? Could you give me some ideas? I would like to try to reproduce your code by myself, thank you very much for your reply at your convenience! ~
You can read the .npy file with numpy,
>>> import numpy as np
>>> a=np.load('imdb_val_filtered_by_image_id.npy')
>>> a[1].keys()
dict_keys(['image_id', 'image_classes', 'flickr_original_url', 'flickr_300k_url', 'image_width', 'image_height', 'set_name', 'image_name', 'image_path', 'feature_path', 'ocr_tokens', 'ocr_info', 'ocr_normalized_boxes', 'obj_normalized_boxes', 'caption_id', 'caption_str', 'caption_tokens', 'reference_strs', 'reference_tokens', 'ocr_confidence'])
Most of them come from imdb file in M4C-Captioner except for ocr_confidence
dict_keys(['image_id', 'image_classes', 'flickr_original_url', 'flickr_300k_url', 'image_width', 'image_height', 'set_name', 'image_name', 'image_path', 'feature_path', 'ocr_tokens', 'ocr_info', 'ocr_normalized_boxes', 'obj_normalized_boxes', 'caption_id', 'caption_str', 'caption_tokens', 'reference_strs', 'reference_tokens'])
'image_id', 'image_classes', 'flickr_original_url', 'flickr_300k_url', 'image_width', 'image_height', 'set_name', 'image_name', 'image_path', 'caption_id', 'caption_str', 'caption_tokens', 'reference_strs', 'reference_tokens'
are from the json file in original TextCaps dataset (or need some processing)
'feature_path', 'ocr_tokens', 'ocr_info', 'ocr_normalized_boxes', 'obj_normalized_boxes',
ocr_confidence` are from your detection and OCR result. Print out
a[1]` and you will understand them
Hello, thank you so much for sharing the code! Can the code of feature extraction be shared? Did you perform feature extraction on the original data set Textcaps?