wjun0830 / QD-DETR

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)
https://arxiv.org/abs/2303.13874
Other
207 stars 16 forks source link

About training on Charades. #17

Closed EdenGabriel closed 1 year ago

EdenGabriel commented 1 year ago

Excuse me, I couldn't reproduce the results reported in the paper on the Charades dataset, even after setting the parameters according to the issue #1 . The C3D features used are obtained from https://drive.google.com/file/d/1CcMwae55Tuve_Ksrp5kONycyR1bVcX8D/view. Furthermore, for the slowfast&clip features, i have modified the code as follows:

# start_end_dataset.py 
                    if self.dset_name == 'charades':
                        model_inputs["saliency_pos_labels"], model_inputs["saliency_neg_labels"], model_inputs["saliency_all_labels"] = \
                        self.get_saliency_labels_sub_as_query(meta["relevant_windows"][0], ctx_l)  # only one gt

and in the eval.py, i modified the "mk_gt_scores" function:

def mk_gt_scores(gt_data, clip_length=1):
    """gt_data, dict, """
    # print("gt_data[duration]",gt_data["duration"])
    num_clips = int(gt_data["duration"] / clip_length)
    saliency_scores_full_video = np.zeros((num_clips, 3))
    relevant_clip_ids = np.arange(int(gt_data["relevant_windows"][0][0]), int(gt_data["relevant_windows"][0][1]))
    # FIXME
    saliency_scores_relevant_clips = np.ones((relevant_clip_ids.shape[0],3))  # (#relevant_clip_ids, 3)
    saliency_scores_full_video[relevant_clip_ids] = saliency_scores_relevant_clips
    return saliency_scores_full_video  # (#clips_in_video, 3)  the scores are in range [0, 4]

Actually, I found that if I don't modify the "mk_gt_scores" code and simply comment out line pred_saliency_scores=saliency_scores[idx] in the inference.py, it produces the same result.

cur_query_pred = dict(
                qid=meta["qid"],
                query=meta["query"],
                vid=meta["vid"],
                pred_relevant_windows=cur_ranked_preds,
                # pred_saliency_scores=saliency_scores[idx]
            ) 

so, can you help me to reproduce the results reported in the paper on the Charades dataset, thanks.

EdenGabriel commented 1 year ago

oh, I discovered that when I tried using VGG (obtained from UMT) features, the model also failed to learn effectively. It only worked when using SlowFast + Clip features.

EdenGabriel commented 1 year ago

the configuration as follows:

dset_name=charades
ctx_mode=video_tef
v_feat_types=c3d
t_feat_type=clip 
results_root=results
exp_id=exp

if [[ ${v_feat_types} == *"c3d"* ]]; then
  v_feat_dirs+=(${feat_root}/charades_c3d_raw)
  (( v_feat_dim += 1024 ))
fi
if [[ ${v_feat_types} == *"rgb"* ]]; then
  v_feat_dirs+=(${feat_root}/charades_rgb_opt/rgb_features)
  (( v_feat_dim += 4096 ))
fi

bsz=32
n_epoch=100
lr_drop=40
lr=0.0002
lw_saliency=4.0
max_v_l=-1
max_q_l=32
clip_length=1

PYTHONPATH=$PYTHONPATH:. python qd_detr/train.py \
--dset_name ${dset_name} \
--ctx_mode ${ctx_mode} \
--train_path ${train_path} \
--eval_path ${eval_path} \
--eval_split_name ${eval_split_name} \
--v_feat_dirs ${v_feat_dirs[@]} \
--v_feat_dim ${v_feat_dim} \
--t_feat_dir ${t_feat_dir} \
--t_feat_dim ${t_feat_dim} \
--bsz ${bsz} \
--n_epoch ${n_epoch} \
--lr_drop ${lr_drop} \
--lr ${lr} \
--lw_saliency ${lw_saliency} \
--max_v_l ${max_v_l} \
--max_q_l ${max_q_l} \
--clip_length ${clip_length} \
--results_root ${results_root}_charades \
--exp_id ${exp_id} \
${@:1}
EdenGabriel commented 1 year ago

Sorry for bothering you. I have resolved the issue. I overlooked the fact that different features corresponded to different clip lengths.

wjun0830 commented 1 year ago

Great! Thanks.

wjun0830 commented 1 year ago

We are very sorry for the inconvenience. Charades-STA experiments with C3D features are actually conducted with I3D features and I3D benchmarking tables. Features are provided here from VSLNET.