zlccccc / 3DVL_Codebase

[CVPR2022 Oral] 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds
Other
51 stars 5 forks source link

Caption accuracy is low #4

Closed Zhang-Jing-Xuan closed 1 year ago

Zhang-Jing-Xuan commented 2 years ago

caption Hi, I'm really interested in this work but why the caption accuracy is so low during training?Thanks.

zlccccc commented 2 years ago

Since we train from scratch, the detector is not fully trained at the beginning of training, which makes the captioning accuracy very low. You can use --num_ground=150 to avoid the training of the caption head for the first 150 epochs

Zhang-Jing-Xuan commented 2 years ago

I see.

Zhang-Jing-Xuan commented 2 years ago

Hi, I have another question. I have finished training, but how to evaluate dense captioning, that is, to obtain C@0.25, B-4@0.25, ... I try to run command:

python scripts/joint_scripts/caption_eval.py --folder <folder_name> --use_multiview --use_normal --no_nms --force --lang_num_max 1 --eval_caption --use_topdown

than, in the terminal, it shows: terminal Does it mean C@0.25=56.5, B-4@0.25=37.8, M@0.25=26.9 and R@0.25=58.1? In addition, in the output folder, there is a best.txt file. It shows: best txt Does it mean C@0.5=43.5, B-4@0.5=29.3, M@0.5=23.6 and R@0.5=49.7?

Do I understand correctly? If so, why C@0.25, C@0.5, B-4@0.25 and B-4@0.5 are much lower than original paper? if not, how to evaluate dense captioning correctly?

zlccccc commented 2 years ago

We retrained the model and there was indeed a problem with the captioning accuracy. This should be a bug in the released codebase, we will compare the released code with previous training files to fix the bug as soon as possible.

zlccccc commented 2 years ago

We have changed the data augmentation strategy of the language module for visual grounding, and now the captioning accuracy should be okay. 图片 We train our joint framework with `--num_ground=120'. This number can be set a bit larger

zlccccc commented 2 years ago

Joint training script: (the screenshot above uses `num_ground_epoch=120') python scripts/joint_scripts/train_3djcg.py --use_multiview --use_normal --use_topdown --num_graph_steps 0 --num_locals 20 --batch_size 10 --epoch 200 --tag joint_train-vg150 --gpu 4 --verbose 50 --val_step 1000 --lang_num_max 8 --coslr --lr 0.002 --num_ground_epoch 150

Training a captioning model using a pretrained model: python scripts/captioning_scripts/train_3djcg_c.py --use_multiview --use_normal --use_topdown --num_graph_steps 0 --num_locals 20 --batch_size 8 --epoch 200 --tag github_c_pretrain --gpu 0 --verbose 50 --val_step 500 --lang_num_max 8 --coslr --lr 0.001 --use_pretrained outputs/exp_joint/2022-07-14_22-25-40_JOINT_TRAIN_GITHUB-VG160 --no_detection

Zhang-Jing-Xuan commented 2 years ago

Thank you for your quick reply. I will try again.

zlccccc commented 1 year ago

I tried the results of 3DJCG in the test set for dense captioning here and may be able to provide some references.

Zhang-Jing-Xuan commented 1 year ago

ok. Thank you for your reply.