Caption accuracy is low

zlccccc / 3DVL_Codebase

[CVPR2022 Oral] 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds

Other

51 stars 5 forks source link

Caption accuracy is low #4

Closed Zhang-Jing-Xuan closed 1 year ago

Zhang-Jing-Xuan commented 2 years ago

caption Hi, I'm really interested in this work but why the caption accuracy is so low during training?Thanks.

zlccccc commented 2 years ago

Since we train from scratch, the detector is not fully trained at the beginning of training, which makes the captioning accuracy very low. You can use --num_ground=150 to avoid the training of the caption head for the first 150 epochs

Zhang-Jing-Xuan commented 2 years ago

I see.

Zhang-Jing-Xuan commented 2 years ago

Hi, I have another question. I have finished training, but how to evaluate dense captioning, that is, to obtain C@0.25, B-4@0.25, ... I try to run command:

python scripts/joint_scripts/caption_eval.py --folder <folder_name> --use_multiview --use_normal --no_nms --force --lang_num_max 1 --eval_caption --use_topdown

than, in the terminal, it shows: Does it mean C@0.25=56.5, B-4@0.25=37.8, M@0.25=26.9 and R@0.25=58.1？ In addition, in the output folder, there is a best.txt file. It shows: best txt Does it mean C@0.5=43.5, B-4@0.5=29.3, M@0.5=23.6 and R@0.5=49.7？

Do I understand correctly? If so, why C@0.25, C@0.5, B-4@0.25 and B-4@0.5 are much lower than original paper? if not, how to evaluate dense captioning correctly？

zlccccc commented 2 years ago

We retrained the model and there was indeed a problem with the captioning accuracy. This should be a bug in the released codebase, we will compare the released code with previous training files to fix the bug as soon as possible.

zlccccc commented 2 years ago

We have changed the data augmentation strategy of the language module for visual grounding, and now the captioning accuracy should be okay. We train our joint framework with `--num_ground=120'. This number can be set a bit larger

zlccccc commented 2 years ago

Joint training script: (the screenshot above uses `num_ground_epoch=120') python scripts/joint_scripts/train_3djcg.py --use_multiview --use_normal --use_topdown --num_graph_steps 0 --num_locals 20 --batch_size 10 --epoch 200 --tag joint_train-vg150 --gpu 4 --verbose 50 --val_step 1000 --lang_num_max 8 --coslr --lr 0.002 --num_ground_epoch 150

Training a captioning model using a pretrained model: python scripts/captioning_scripts/train_3djcg_c.py --use_multiview --use_normal --use_topdown --num_graph_steps 0 --num_locals 20 --batch_size 8 --epoch 200 --tag github_c_pretrain --gpu 0 --verbose 50 --val_step 500 --lang_num_max 8 --coslr --lr 0.001 --use_pretrained outputs/exp_joint/2022-07-14_22-25-40_JOINT_TRAIN_GITHUB-VG160 --no_detection

Zhang-Jing-Xuan commented 2 years ago

Thank you for your quick reply. I will try again.

zlccccc commented 1 year ago

I tried the results of 3DJCG in the test set for dense captioning here and may be able to provide some references.

Zhang-Jing-Xuan commented 1 year ago

ok. Thank you for your reply.