The replication results do not match the reported results.

zjukg / Structure-CLIP

[Paper][AAAI2024]Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

https://arxiv.org/abs/2305.06152

116 stars 6 forks source link

The replication results do not match the reported results. #5

Closed ZHUXUHAN closed 8 months ago

ZHUXUHAN commented 9 months ago

i use your provided script as:

TRAIN_PATH = train_coco_aug_withneg_adjchange_merge.json TEST_PATH = visual_genome_attribution_aug.json model=openai-clip:ViT-B/32

CUDA_VISIBLE_DEVICES=0 python ./model/train.py \ --project 504_neg1 \ --name CocoAndVG \ --model-name=$model \ --train_path ${TRAIN_PATH} \ --test_path ${TEST_PATH} \ --manualSeed 120 \ --batch_size 64 \ --lr 5e-6 \ --epoch 10 \ --weight_decay 0.1 \ --knowledge_weight 0.2 \ --transformer_layer_num 6 \ --neg_loss_weight 5 \ --device=cuda 截屏2024-02-12 10 12 27

the TextRank1 and ImageRank1 appear to be inconsistent with the reported performance.

Moreover, the best attribution and relation metric are not from the same model checkpoint: best acc_test_relation is from the 7th epoch checkpoint and the best acc_test_attribution is from the 1th checkpoint. These results seem to exhibit a strong degree of randomness.

截屏2024-02-12 10 25 51 截屏2024-02-12 10 28 24

What might be the problem?

BigHyf commented 9 months ago

Possible reasons for the discrepancy between our results are as follows: Firstly, there may be some deviation between the machines, which leads to your results being lower than mine. Secondly, since it has been a year, you may not be very clear about whether there were any subtle adjustments to the training data at that time. However, in my impression, the best result did not come from the first epoch.

ZHUXUHAN commented 9 months ago

Thanks for your answer.

Your open-sourced code logic relies on utilizing separate best checkpoints. can you provide a model and log to validate the results are from the same checkpoint?
I trained using the training set, code and script you provided. I believe it's necessary for you to verify whether the datasets currently provided are the same ones used at the time, rather than requiring me to confirm.
I don't believe the difference between machines would result in such a significant gap. It's evident that the difference exceeds 2 points compared to the paper, which shouldn't be considered a minor gap.

By the way, why the sum of clip scores in fig.1 are 1?

BigHyf commented 9 months ago

I am not sure which part's difference caused the inconsistency in the current result. We will check again and reply to you. The reason why the sum of clip scores is 1 is that we have normalized them after obtaining the clip scores.

ZHUXUHAN commented 9 months ago

thanks for your answer!

BigHyf commented 9 months ago

This is because we were too busy before and left the company in a hurry, so we haven't sorted out the final code. We will check and reply to you as soon as possible after the Chinese New Year.

Thank you very much for your comments!

ZHUXUHAN commented 9 months ago

This is because we were too busy before and left the company in a hurry, so we haven't sorted out the final code, which seems to be the code of the ablation study experiment. We will check and reply to you as soon as possible after the Chinese New Year.

Thank you very much for your comments!

I understand. Happy new year, wish you a pleasant holiday!

BigHyf commented 9 months ago

Thank you very much for your attention to our work and valuable suggestions. As this work was completed in the company, I only took the code with me when I left and did not bring any other content. Recently, I have re-implemented our code and conducted the following experiments: (1) With other parameters kept consistent (batch_size=128), our pre-training recall metric is basically consistent with the reported results. At the same time, our model can achieve 80.6 and 82.7 on two downstream datasets, respectively, and compared with other models, our model is still sota. (2) Based on this, we adjusted the batch_size to 64, and the performance on the downstream task dataset is basically close to the reported results, which are 82.0 and 83.6, respectively (when we used the same checkpoint in the company, the results were indeed 82.3 and 84.7), but the pre-training dataset metrics will be affected to some extent due to the reduction of batch_size. In addition, I noticed that you validate once per epoch, while we validate once every 200 iterations and every epoch. In the few days, We will provide two checkpoints: (1) general_struct_balanced. ckpt, (2) struct_best.ckpt.