flickr30k_res50_nms1e3_feat_pascal/4891383938.pkl

youngfly11 / LCMCG-PyTorch

AAAI2020-The official implementation of "Learning Cross-modal Context Graph for Visual Grounding"

57 stars 12 forks source link

flickr30k_res50_nms1e3_feat_pascal/4891383938.pkl #9

Closed JCZ404 closed 2 years ago

JCZ404 commented 2 years ago

Thank you very much for doing such an excellent job. Your work has inspired me a lot. I wanted to learn by reproducing your work, but I ran into some difficulties. When I ran your code, I got this error: It seems that this precompute feature map is missing, but, How can I generate this feature map by myself? Could you provide this file. Thanks a lot.

youngfly11 commented 2 years ago

You need to follow step2 to generate your own features. You can download the weight to generate the feature map by yourself. by the way, I have uploaded it sg_anno.json in google drive.

JCZ404 commented 2 years ago

Thank you very much for your reply. In fact, I have just been in touch with this field not too long, and I think your thesis is very excellent when I see it, which makes me learn a lot. I've been working on this project for quite few time, and I've generated the scene graph myself. Currently, the precomputed feature map is my last and biggest problem ):, so, I would like to ask if you could send me a copy of this pre-calculated feature map. Thank you!!!

youngfly11 commented 2 years ago

Sorry. All the feature maps are around 300G. I cannot upload it. You can generate it by yourself

JCZ404 commented 2 years ago

That's all right, thank you anyway! But can I confirm some details about the feature map generation with you? These feature maps are generated by forwarding each image in the Flickr-30 dataset into the already well-trained Faster-RCNN model(The model in the maskrcnn-benchmark with provided weight) which is like doing the inference, and then we just stored the feature map at C5 and the bbox after doing NMS into the pkl file, which is a dict type data for each image. Is this understanding right?

youngfly11 commented 2 years ago

You need to use the original maskrcnn repo to extract the feature. There are two-steps you need to do:

Using maskrcnn to extract the object bounding boxes by using nms=0.3
store the feature map at C4, not C5. Because our method needs to crop the RoI feature by itself.

JCZ404 commented 2 years ago

Thank you so much for you reply! I generate the feature map as follow: that is: （1）res['feature'] come from the C4, its shape is [1, 1024, img_ori_H/16, img_ori_W/16] （2）res['box'] is the bbox from cls and reg branch after nms, eg. the proposal generated from rpn after nms is [1000,4]，then undergo the cls and reg brach, threshold and nms per class, get the final bbox is [n,4], n vary with the image. （3）res['img_scale'] is the image size input to the maskrcnn, eg. [1200,800] ie. maxsize: 1333, minsize: 800

JCZ404 commented 2 years ago

Besides, I also have the question about the split/train.txt file. In the code, it seems each line, there have the image_id and the sentence_id like [image_id \t sentence_id]，but I find the train.txt in Flickr-30k dataset only has the image_id, so I add the sentence_id eachline, all of them is 0, I'm not sure if this right

JCZ404 commented 2 years ago

Finally,because when I run my code like above, there still are some strange errors :), so, I wan to know that if you can randomly upload some small part of the feature maps, such as 100 or 50 feature map of the training samples, and then I can make the train.txt only contain these training samples, so I can run your code to debug. I don't know if this scheme can work. Thank you!

youngfly11 commented 2 years ago

Besides, I also have the question about the split/train.txt file. In the code, it seems each line, there have the image_id and the sentence_id like [image_id \t sentence_id]，but I find the train.txt in Flickr-30k dataset only has the image_id, so I add the sentence_id eachline, all of them is 0, I'm not sure if this right

Yes, we manually pair each image with their sentences. So that we can sample them by the number of sentences rather than images. It is very simple. you can do this by yourself. image_id, 0, image, 1, image_id, 2, image_id, 3, image_id, 4.

youngfly11 commented 2 years ago

Finally,because when I run my code like above, there still are some strange errors :), so, I wan to know that if you can randomly upload some small part of the feature maps, such as 100 or 50 feature map of the training samples, and then I can make the train.txt only contain these training samples, so I can run your code to debug. I don't know if this scheme can work. Thank you!

Sorry, I cannot do that for you. Because I am out of campus and doing my internship outside now so that I cannot access the server in school.