Closed ZHUXUHAN closed 8 months ago
Possible reasons for the discrepancy between our results are as follows: Firstly, there may be some deviation between the machines, which leads to your results being lower than mine. Secondly, since it has been a year, you may not be very clear about whether there were any subtle adjustments to the training data at that time. However, in my impression, the best result did not come from the first epoch.
Thanks for your answer.
By the way, why the sum of clip scores in fig.1 are 1?
I am not sure which part's difference caused the inconsistency in the current result. We will check again and reply to you. The reason why the sum of clip scores is 1 is that we have normalized them after obtaining the clip scores.
thanks for your answer!
This is because we were too busy before and left the company in a hurry, so we haven't sorted out the final code. We will check and reply to you as soon as possible after the Chinese New Year.
Thank you very much for your comments!
This is because we were too busy before and left the company in a hurry, so we haven't sorted out the final code, which seems to be the code of the ablation study experiment. We will check and reply to you as soon as possible after the Chinese New Year.
Thank you very much for your comments!
I understand. Happy new year, wish you a pleasant holiday!
Thank you very much for your attention to our work and valuable suggestions. As this work was completed in the company, I only took the code with me when I left and did not bring any other content. Recently, I have re-implemented our code and conducted the following experiments: (1) With other parameters kept consistent (batch_size=128), our pre-training recall metric is basically consistent with the reported results. At the same time, our model can achieve 80.6 and 82.7 on two downstream datasets, respectively, and compared with other models, our model is still sota. (2) Based on this, we adjusted the batch_size to 64, and the performance on the downstream task dataset is basically close to the reported results, which are 82.0 and 83.6, respectively (when we used the same checkpoint in the company, the results were indeed 82.3 and 84.7), but the pre-training dataset metrics will be affected to some extent due to the reduction of batch_size. In addition, I noticed that you validate once per epoch, while we validate once every 200 iterations and every epoch. In the few days, We will provide two checkpoints: (1) general_struct_balanced. ckpt, (2) struct_best.ckpt.
i use your provided script as:
TRAIN_PATH = train_coco_aug_withneg_adjchange_merge.json TEST_PATH = visual_genome_attribution_aug.json model=openai-clip:ViT-B/32
CUDA_VISIBLE_DEVICES=0 python ./model/train.py \ --project 504_neg1 \ --name CocoAndVG \ --model-name=$model \ --train_path ${TRAIN_PATH} \ --test_path ${TEST_PATH} \ --manualSeed 120 \ --batch_size 64 \ --lr 5e-6 \ --epoch 10 \ --weight_decay 0.1 \ --knowledge_weight 0.2 \ --transformer_layer_num 6 \ --neg_loss_weight 5 \ --device=cuda
the TextRank1 and ImageRank1 appear to be inconsistent with the reported performance.
Moreover, the best attribution and relation metric are not from the same model checkpoint: best acc_test_relation is from the 7th epoch checkpoint and the best acc_test_attribution is from the 1th checkpoint. These results seem to exhibit a strong degree of randomness.
What might be the problem?