wjun0830 / QD-DETR

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)
https://arxiv.org/abs/2303.13874
Other
199 stars 15 forks source link

run_on_video error #3

Closed wangzhilong closed 1 year ago

wangzhilong commented 1 year ago

In the file run_on_video/model_utils.py, the import statement for MomentDETR is incorrect. The import statement is from qd_detr.model import build_transformer, build_position_encoding, MomentDETR, but MomentDETR is not present in qd_detr.model. Instead, QDDETR is present in qd_detr.model which should be used instead of MomentDETR.

But when i use QDDETR instead of MomentDETR, from qd_detr.model import build_transformer, build_position_encoding, QDDETR as MomentDETR,

then run the run.py, a error happen, how to fix?

RuntimeError: Error(s) in loading state_dict for QDDETR:
        Missing key(s) in state_dict: "global_rep_token", "global_rep_pos", "transformer.t2v_encoder.layers.0.self_attn.in_proj_weight", "transformer.t2v_encoder.layers.0.self_attn.in_proj_bias", "transformer.t2v_encoder.layers.0.self_attn.out_proj.weight", "transformer.t2v_encoder.layers.0.self_attn.out_proj.bias", "transformer.t2v_encoder.layers.0.linear1.weight", "transformer.t2v_encoder.layers.0.linear1.bias", "transformer.t2v_encoder.layers.0.linear2.weight", "transformer.t2v_encoder.layers.0.linear2.bias", "transformer.t2v_encoder.layers.0.norm1.weight", "transformer.t2v_encoder.layers.0.norm1.bias", "transformer.t2v_encoder.layers.0.norm2.weight", "transformer.t2v_encoder.layers.0.norm2.bias", "transformer.t2v_encoder.layers.0.activation.weight", "transformer.t2v_encoder.layers.1.self_attn.in_proj_weight", "transformer.t2v_encoder.layers.1.self_attn.in_proj_bias", "transformer.t2v_encoder.layers.1.self_attn.out_proj.weight", "transformer.t2v_encoder.layers.1.self_attn.out_proj.bias", "transformer.t2v_encoder.layers.1.linear1.weight", "transformer.t2v_encoder.layers.1.linear1.bias", "transformer.t2v_encoder.layers.1.linear2.weight", "transformer.t2v_encoder.layers.1.linear2.bias", "transformer.t2v_encoder.layers.1.norm1.weight", "transformer.t2v_encoder.layers.1.norm1.bias", "transformer.t2v_encoder.layers.1.norm2.weight", "transformer.t2v_encoder.layers.1.norm2.bias", "transformer.t2v_encoder.layers.1.activation.weight", "transformer.encoder.layers.0.activation.weight", "transformer.encoder.layers.1.activation.weight", "transformer.decoder.layers.0.sa_qcontent_proj.weight", "transformer.decoder.layers.0.sa_qcontent_proj.bias", "transformer.decoder.layers.0.sa_qpos_proj.weight", "transformer.decoder.layers.0.sa_qpos_proj.bias", "transformer.decoder.layers.0.sa_kcontent_proj.weight", "transformer.decoder.layers.0.sa_kcontent_proj.bias", "transformer.decoder.layers.0.sa_kpos_proj.weight", "transformer.decoder.layers.0.sa_kpos_proj.bias", "transformer.decoder.layers.0.sa_v_proj.weight", "transformer.decoder.layers.0.sa_v_proj.bias", "transformer.decoder.layers.0.ca_qcontent_proj.weight", "transformer.decoder.layers.0.ca_qcontent_proj.bias", "transformer.decoder.layers.0.ca_qpos_proj.weight", "transformer.decoder.layers.0.ca_qpos_proj.bias", "transformer.decoder.layers.0.ca_kcontent_proj.weight", "transformer.decoder.layers.0.ca_kcontent_proj.bias", "transformer.decoder.layers.0.ca_kpos_proj.weight", "transformer.decoder.layers.0.ca_kpos_proj.bias", "transformer.decoder.layers.0.ca_v_proj.weight", "transformer.decoder.layers.0.ca_v_proj.bias", "transformer.decoder.layers.0.ca_qpos_sine_proj.weight", "transformer.decoder.layers.0.ca_qpos_sine_proj.bias", "transformer.decoder.layers.0.cross_attn.out_proj.weight", "transformer.decoder.layers.0.cross_attn.out_proj.bias", "transformer.decoder.layers.0.activation.weight", "transformer.decoder.layers.1.sa_qcontent_proj.weight", "transformer.decoder.layers.1.sa_qcontent_proj.bias", "transformer.decoder.layers.1.sa_qpos_proj.weight", "transformer.decoder.layers.1.sa_qpos_proj.bias", "transformer.decoder.layers.1.sa_kcontent_proj.weight", "transformer.decoder.layers.1.sa_kcontent_proj.bias", "transformer.decoder.layers.1.sa_kpos_proj.weight", "transformer.decoder.layers.1.sa_kpos_proj.bias", "transformer.decoder.layers.1.sa_v_proj.weight", "transformer.decoder.layers.1.sa_v_proj.bias", "transformer.decoder.layers.1.ca_qcontent_proj.weight", "transformer.decoder.layers.1.ca_qcontent_proj.bias", "transformer.decoder.layers.1.ca_kcontent_proj.weight", "transformer.decoder.layers.1.ca_kcontent_proj.bias", "transformer.decoder.layers.1.ca_kpos_proj.weight", "transformer.decoder.layers.1.ca_kpos_proj.bias", "transformer.decoder.layers.1.ca_v_proj.weight", "transformer.decoder.layers.1.ca_v_proj.bias", "transformer.decoder.layers.1.ca_qpos_sine_proj.weight", "transformer.decoder.layers.1.ca_qpos_sine_proj.bias", "transformer.decoder.layers.1.cross_attn.out_proj.weight", "transformer.decoder.layers.1.cross_attn.out_proj.bias", "transformer.decoder.layers.1.activation.weight", "transformer.decoder.query_scale.layers.0.weight", "transformer.decoder.query_scale.layers.0.bias", "transformer.decoder.query_scale.layers.1.weight", "transformer.decoder.query_scale.layers.1.bias", "transformer.decoder.ref_point_head.layers.0.weight", "transformer.decoder.ref_point_head.layers.0.bias", "transformer.decoder.ref_point_head.layers.1.weight", "transformer.decoder.ref_point_head.layers.1.bias", "transformer.decoder.bbox_embed.layers.0.weight", "transformer.decoder.bbox_embed.layers.0.bias", "transformer.decoder.bbox_embed.layers.1.weight", "transformer.decoder.bbox_embed.layers.1.bias", "transformer.decoder.bbox_embed.layers.2.weight", "transformer.decoder.bbox_embed.layers.2.bias", "transformer.decoder.ref_anchor_head.layers.0.weight", "transformer.decoder.ref_anchor_head.layers.0.bias", "transformer.decoder.ref_anchor_head.layers.1.weight", "transformer.decoder.ref_anchor_head.layers.1.bias", "saliency_proj1.weight", "saliency_proj1.bias", "saliency_proj2.weight", "saliency_proj2.bias".
        Unexpected key(s) in state_dict: "saliency_proj.weight", "saliency_proj.bias", "transformer.decoder.layers.0.multihead_attn.in_proj_weight", "transformer.decoder.layers.0.multihead_attn.in_proj_bias", "transformer.decoder.layers.0.multihead_attn.out_proj.weight", "transformer.decoder.layers.0.multihead_attn.out_proj.bias", "transformer.decoder.layers.0.self_attn.in_proj_weight", "transformer.decoder.layers.0.self_attn.in_proj_bias", "transformer.decoder.layers.1.multihead_attn.in_proj_weight", "transformer.decoder.layers.1.multihead_attn.in_proj_bias", "transformer.decoder.layers.1.multihead_attn.out_proj.weight", "transformer.decoder.layers.1.multihead_attn.out_proj.bias", "transformer.decoder.layers.1.self_attn.in_proj_weight", "transformer.decoder.layers.1.self_attn.in_proj_bias".
        size mismatch for query_embed.weight: copying a param with shape torch.Size([10, 256]) from checkpoint, the shape in current model is torch.Size([10, 2]).
wjun0830 commented 1 year ago

Sorry for the inconvenience. I think that the pretrained weight is from Moment-DETR not from our GitHub repository.

Can you try again with the weights provided in our repository?

Video only weights : https://www.dropbox.com/s/yygwyljw8514d9r/videoonly.ckpt?dl=0 V + A weights : https://www.dropbox.com/s/hsc7jk21ppqasjt/videoaudio.ckpt?dl=0

wangzhilong commented 1 year ago

Thank your reply. I use videoaudio.ckpt, get the error:

  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for QDDETR:
        size mismatch for input_vid_proj.0.LayerNorm.weight: copying a param with shape torch.Size([4868]) from checkpoint, the shape in current model is torch.Size([2818]).
        size mismatch for input_vid_proj.0.LayerNorm.bias: copying a param with shape torch.Size([4868]) from checkpoint, the shape in current model is torch.Size([2818]).
        size mismatch for input_vid_proj.0.net.1.weight: copying a param with shape torch.Size([256, 4868]) from checkpoint, the shape in current model is torch.Size([256, 2818]).
wjun0830 commented 1 year ago

Can you try with the checkpoint trained only with video? To use the video+audio checkpoint, you may have to change some code and your dataset to have extracted audio features.

wangzhilong commented 1 year ago

I have tried whe the checkpoint trained only with video: videoonly.ckpt, but error still happen。The shape of the model and the weights not match.

  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/normalization.py", line 190, in forward
    input, self.normalized_shape, self.weight, self.bias, self.eps)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/functional.py", line 2347, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Given normalized_shape=[2818], expected input with shape [*, 2818], but got input of size[1, 75, 514]
wjun0830 commented 1 year ago

If you see the given train script, shape of features should be 2304(slowfast)+512(clip). It looks like you only have clip features.

nguyenquyem99dt commented 1 year ago

I also have an error when running run_on_video/run.py. I have used both videoonly.ckpt (https://www.dropbox.com/s/yygwyljw8514d9r/videoonly.ckpt?dl=0) and video_model_best.ckpt (run_on_video/qd_detr_ckpt/)

Error logs are below:

File "run_on_video/run.py", line 126, in run_example() File "run_on_video/run.py", line 109, in run_example predictions = qd_detr_predictor.localize_moment( File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "run_on_video/run.py", line 57, in localize_moment outputs = self.model(model_inputs) File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/projects/moment-retrieval/QD-DETR/qd_detr/model.py", line 110, in forward src_vid = self.input_vid_proj(src_vid) File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/projects/moment-retrieval/QD-DETR/qd_detr/model.py", line 505, in forward x = self.LayerNorm(x) File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward return F.layer_norm( File "/home/ubuntu/projects/moment-retrieval/envs/moment-detr/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Given normalized_shape=[2818], expected input with shape [, 2818], but got input of size[1, 75, 514]

wjun0830 commented 1 year ago

It seems that your feature size is also 512 that you also need to extract slowfast feature

dmenig commented 1 year ago

I have the same issue. I believe the script written on the repo should not produce this error if used as is.

wjun0830 commented 1 year ago

Hello. For all of you in this thread, thank you for your interest, and sorry for the inconvenience. I'll let you know through this thread when the model checkpoint trained only with CLIP features is ready.

Thanks.

wjun0830 commented 1 year ago

We've uploaded pretrained model only trained with CLIP features to support run on video. You may try an example with it! Thank you.

dmenig commented 1 year ago

Which one is it ?

wjun0830 commented 1 year ago

model_best.ckpt is the model trained with only Clip features.

dmenig commented 1 year ago

It now works thanks. I suggest to change the default model used on master.

wjun0830 commented 1 year ago

Thank you for the suggestion. Do you mean to change the default loaded model in run_on_video/run.py?