yiskw713 / VideoCaptioning

video captioning using 3DCNN and LSTM (pytorch)
10 stars 5 forks source link

something confused me and looking for your help #2

Closed zhangguangxun closed 4 years ago

zhangguangxun commented 4 years ago

Hello, yiskw713:

I am rebuilding your repo and during the rebuilding, I was confused about some concepts in config.yaml.

First, is dataset_dir means the dir of features extracted by another repo you proposed which are the pth files Second, is feature_dir means the parameters to initialize the neural network which means the model in your repo is retrained rather than initialize randomly? Third, in the previous issue, you mentioned that this repo is changed from an image caption code, do you have the paper about that method? Thanks.

yiskw713 commented 4 years ago

Hi zhangguangxun,

Thanks for visiting my repo.

First, is dataset_dir means the dir of features extracted by another repo you proposed which are the pth files Second, is feature_dir means the parameters to initialize the neural network which means the model in your repo is retrained rather than initialize randomly?

Sorry for confusing you. My directory structure is like this:

dataset_dir/ ─── feature_dir/
              ├─ hdf5_dir/ (video dir) 
              └─ anno_file (.json)

dataset_dir is the path to a directory that contains videos and features. feature_dir is the relative path from dataset_dir.

Third, in the previous issue, you mentioned that this repo is changed from an image caption code, do you have the paper about that method?

I just referred to this page, but I think this paper is like the method I used.

I hope this will help you. Thanks

zhangguangxun commented 4 years ago

Thanks a lot~

zhangguangxun commented 4 years ago

During my training period, I still have the same question that NotImplementedError: Input Error: Only 3D, 4D and 5D input Tensors supported (got 6D) for the modes: nearest | linear | bilinear | bicubic | trilinear (got trilinear) However when I opened the pth file such as video0.pth I found that the feature dimensions are 5D, so I guess is still the problem of my path file. So could you do me a favor to help me to check my path file

dataset: MSR-VTT
dataset_dir: /media/zgx_docker_data/video_feature_extractor/data/MSRVTT/
feature_dir: ./TrainValFeature
hdf5_dir: ./TrainValVideohdf5
ann_file: /media/zgx_docker_data/VideoCaptioning/data/vocal1/train_val_videodatainfo.json
vocab_path: ./data/vocal1/vocab.pkl

the TrainValFeature is

TrainValFeature /───video0.pth
  ├─ video1.pth
  ├─ video2.pth
...

and the TrainValVideohdf5 is

TrainValVideohdf5 /───video0.hdf5
  ├─ video1.hdf5
  ├─ video2.hdf5
...

maybe I misunderstood the meaning of feature_dir r50_k700_16f, is that the path to save the feature? By the way, I was wondering how could I type├─and └─easily which are copied from yours

zhangguangxun commented 4 years ago

I tried and was sure that the feature dir is TrainValFeature you can ignore maybe I misunderstood the meaning of feature_dir r50_k700_16f, is that the path to save the feature? But why it was detected 6D?when I checked the video.pth it is exactly 5D?

>>> n = '../video_feature_extractor/data/MSRVTT/TrainValFeature/video1.pth'
>>> net = torch.load(n)
>>> print(net.shape)
torch.Size([1, 2048, 35, 7, 7])
>>> n = '../video_feature_extractor/data/MSRVTT/TrainValFeature/video2.pth'
>>> net = torch.load(n)
>>> print(net.shape)
torch.Size([1, 2048, 19, 7, 7])
>>> n = '../video_feature_extractor/data/MSRVTT/TrainValFeature/video3.pth'
>>> net = torch.load(n)
>>> print(net.shape)
torch.Size([1, 2048, 15, 7, 7])
hayachiq commented 3 years ago

hi, am facing similar issue ValueError: size shape must match input shape. Input is 4D, size is 3

zhangguangxun commented 3 years ago

hi, am facing similar issue ValueError: size shape must match input shape. Input is 4D, size is 3

what I had done is changing the file in dataset.py in line71, change ft.unsqueeze(0) to ft which the shape will be matched. the question in my code is that it expected input is [2048,m,7,7] however the features I got from the previous extracted code is [1,2048,m,7,7] I wish it will help and good luck.

hayachiq commented 3 years ago

thank u for the help, can you kindly suggest any enhancement or improve for generating a video captioning