Closed Zhang-Jing-Xuan closed 1 year ago
Since we train from scratch, the detector is not fully trained at the beginning of training, which makes the captioning accuracy very low. You can use --num_ground=150 to avoid the training of the caption head for the first 150 epochs
I see.
Hi, I have another question. I have finished training, but how to evaluate dense captioning, that is, to obtain C@0.25, B-4@0.25, ... I try to run command:
python scripts/joint_scripts/caption_eval.py --folder <folder_name> --use_multiview --use_normal --no_nms --force --lang_num_max 1 --eval_caption --use_topdown
than, in the terminal, it shows: Does it mean C@0.25=56.5, B-4@0.25=37.8, M@0.25=26.9 and R@0.25=58.1? In addition, in the output folder, there is a best.txt file. It shows: Does it mean C@0.5=43.5, B-4@0.5=29.3, M@0.5=23.6 and R@0.5=49.7?
Do I understand correctly? If so, why C@0.25, C@0.5, B-4@0.25 and B-4@0.5 are much lower than original paper? if not, how to evaluate dense captioning correctly?
We retrained the model and there was indeed a problem with the captioning accuracy. This should be a bug in the released codebase, we will compare the released code with previous training files to fix the bug as soon as possible.
We have changed the data augmentation strategy of the language module for visual grounding, and now the captioning accuracy should be okay. We train our joint framework with `--num_ground=120'. This number can be set a bit larger
Joint training script: (the screenshot above uses `num_ground_epoch=120') python scripts/joint_scripts/train_3djcg.py --use_multiview --use_normal --use_topdown --num_graph_steps 0 --num_locals 20 --batch_size 10 --epoch 200 --tag joint_train-vg150 --gpu 4 --verbose 50 --val_step 1000 --lang_num_max 8 --coslr --lr 0.002 --num_ground_epoch 150
Training a captioning model using a pretrained model: python scripts/captioning_scripts/train_3djcg_c.py --use_multiview --use_normal --use_topdown --num_graph_steps 0 --num_locals 20 --batch_size 8 --epoch 200 --tag github_c_pretrain --gpu 0 --verbose 50 --val_step 500 --lang_num_max 8 --coslr --lr 0.001 --use_pretrained outputs/exp_joint/2022-07-14_22-25-40_JOINT_TRAIN_GITHUB-VG160 --no_detection
Thank you for your quick reply. I will try again.
I tried the results of 3DJCG in the test set for dense captioning here and may be able to provide some references.
ok. Thank you for your reply.
Hi, I'm really interested in this work but why the caption accuracy is so low during training?Thanks.