xwen99 / temporal_context_aggregation

Temporal Context Aggregation for Video Retrieval with Contrastive Learning, WACV 2021
https://arxiv.org/abs/2008.01334
Apache License 2.0
27 stars 3 forks source link

Cannot achieve score writeen on paper #5

Closed y2sman closed 2 years ago

y2sman commented 2 years ago

Hello. I've been trying to restore paper's score. However... i failed to achieve the same metric on your paper.

Here is the diffenerence between TCA and my tries.

  1. VCDB background dataset is a little bit different what u have. So, there is 80 videos missing in my extracted_vcdb_feature.
  2. There are no information about how randomly sampled frames while training PCA. Is it 10 frames correct written in github pre_processing code?
  3. In paper, Transformer model's dropout rate is set to 0.5. However, it is set to 0.2. (In train.py)
  4. In "evaluation.py", cosine similarity is not working. (Other's work) So, i used my own calculation code for cosine similarity.

I'm leaving the result of measuring the performance based on the thesis information.

python3 evaluation.py --dataset FIVR-5K --pca_components 1024 --num_clusters 256 --num_layers 1 --output_dim 1024 --padding_size 300 --metric sym_chamfer --model_path models/model_v5_with_all_bg.pth --feature_path pre_processing/fivr_imac_pca1024.hdf5

===== FIVR-5K Dataset =====
Queries: 50 videos
Database: 5049 videos
----------------
DSVR mAP: 0.8029
CSVR mAP: 0.7893
ISVR mAP: 0.7040
python evaluation_org.py --dataset FIVR-5K --pca_components 1024 --num_cluster 256  --num_layer 1 --output_dim 1024 --padding_size 300  --metric cosine --model_path models/model_v5_with_all_bg.pth  --feature_path pre_processing/fivr_imac_pca1024.hdf5 --random_sampling

========================== mAP ==========================

        mAP@1      mAP@10     mAP@100    mAP@200      mAP
----  ---------  ---------  ---------  ---------  ---------
DSVR     0.9400     0.9230     0.7731     0.7382     0.5761
CSVR     0.9400     0.9339     0.7828     0.7414     0.5618
ISVR     0.9800     0.9701     0.8087     0.7525     0.4970

I really wanted to get the same metric on your paper. Please let me know which one is different to your work. Thanks a lot.

xwen99 commented 2 years ago

Hi ys2man, thank you for letting us know your concern.

  1. losing ~80 videos in the training set shouldn't be a problem for the performance, compared to the big size of the VCDB dataset.
  2. As mentioned in the paper, we sample 997,090 frames from the VCDB dataset, i.e. 10 frames per video, so this is correct.
  3. The dropout rate is not crucial, please follow the paper.
  4. To better locate the problem, may I ask why the cosine similarity isn't working, how the first table is obtained (the performance seems good), and what's the difference between evaluation.py and evaluation_org.py?
y2sman commented 2 years ago

Hi ys2man, thank you for letting us know your concern.

  1. losing ~80 videos in the training set shouldn't be a problem for the performance, compared to the big size of the VCDB dataset.
  2. As mentioned in the paper, we sample 997,090 frames from the VCDB dataset, i.e. 10 frames per video, so this is correct.
  3. The dropout rate is not crucial, please follow the paper.
  4. To better locate the problem, may I ask why the cosine similarity isn't working, how the first table is obtained (the performance seems good), and what's the difference between evaluation.py and evaluation_org.py?

Thanks for reply. Before i start, it is good to hear that first table's performance is good.

The difference between evaluation.py and evaluation_org.py is not quite big. I just write my own cosine similiarity code for evaluation, because the orginial code wasn't worked.

I attached my error message here with evaluation.py to calc cosine similarity.

python3 evaluation.py --dataset FIVR-5K --pca_components 1024 --num_clusters 256 --num_layers 1 --output_dim 1024 --padding_size 64 --metric cosine --model_path models/model_v5_with_all_bg.pth --feature_path pre_processing/fivr_imac_pca1024.hdf5 --random_sampling 
Comparator is ...  False
loading features...
...features loaded
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 156.28it/s]
  0%|                                                                                                                                                                                                                                                                    | 0/5000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "evaluation.py", line 331, in <module>
    main()
  File "evaluation.py", line 327, in main
    eval_function(model, dataset, args)
  File "evaluation.py", line 258, in query_vs_database
    sims = calculate_similarities(queries, embedding, qr_video_dict, args.metric, comparator)
  File "evaluation.py", line 50, in calculate_similarities
    cdist(query_features, target_feature, metric='cosine'))
  File "/usr/local/envs/etri/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2717, in cdist
    raise ValueError('XA must be a 2-dimensional array.')
ValueError: XA must be a 2-dimensional array.

Other methods(euclidian, chamfer, sym_chamfer) are worked perfect. I added visil's pre_trained weight for video_comparator but it stopped while calculating.

python3 evaluation.py --dataset FIVR-5K --pca_components 1024 --num_clusters 256 --num_layers 1 --output_dim 1024 --padding_size 64 --metric chamfer --model_path models/model_v5_with_all_bg.pth --feature_path pre_processing/fivr_imac_pca1024.hdf5 --random_sampling --use_comparator
Comparator is ...  True
loading features...
...features loaded
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 505.95it/s]
 48%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                                                  | 2383/5000 [01:45<01:56, 22.49it/s]
Traceback (most recent call last):
  File "evaluation.py", line 331, in <module>
    main()
  File "evaluation.py", line 327, in main
    eval_function(model, dataset, args)
  File "evaluation.py", line 258, in query_vs_database
    sims = calculate_similarities(queries, embedding, qr_video_dict, args.metric, comparator)
  File "evaluation.py", line 56, in calculate_similarities
    sim = chamfer(query, target_feature, comparator)
  File "evaluation.py", line 71, in chamfer
    simmatrix = comparator(simmatrix).detach()
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kjlee/workspace/temporal_context_aggregation/model.py", line 620, in forward
    sim = self.mpool2(sim)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/modules/pooling.py", line 164, in forward
    self.return_indices)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/_jit_internal.py", line 405, in fn
    return if_false(*args, **kwargs)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/functional.py", line 718, in _max_pool2d
    return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: Given input size: (64x22x1). Calculated output size: (64x11x0). Output size is too small

The two problems I sent above are the problems I have now. I think it is structured in accordance with what is in the paper, and as I said above, other parts are parts that do not affect performance. In this situation, can you provide "pre_trained_model" or provide accurate parameter values? And please check whether the cosine similarity calculation code works.

xwen99 commented 2 years ago

Hi @y2sman,

Just noticed that you are trying to align with our results on FIVR-5K reported in the ablation study section. However, for one thing, this subset is somehow too small and may produce unstable results; for another, as that table is only for ablation study, we only ensure ablating one hyper-parameter per subtable, and not all hyper-parameters are perfectly aligned with our final run on FIVR-200K (so it is ok to have different results with the ablation study section). So I recommend trying to experiment with FIVR-200K or running multiple times with FIVR-5K for stable results.

About your questions, I just use scipy to calculate the cosine similarities, please check their document for the error message, it seems that your tensor shape is not suitable, and for the ViSiL video comparator, please note that they require each video to have at least 4 frames, your error message may indicate a too-short video. Recently they released their official PyTorch code, which may be helpful for you: https://github.com/MKLab-ITI/visil/tree/pytorch

BTW, I wonder if the problem is only with the cosine similarity metric, are other metrics fine?

zcgeqian commented 1 year ago

Hi ys2man, thank you for letting us know your concern.

  1. losing ~80 videos in the training set shouldn't be a problem for the performance, compared to the big size of the VCDB dataset.
  2. As mentioned in the paper, we sample 997,090 frames from the VCDB dataset, i.e. 10 frames per video, so this is correct.
  3. The dropout rate is not crucial, please follow the paper.
  4. To better locate the problem, may I ask why the cosine similarity isn't working, how the first table is obtained (the performance seems good), and what's the difference between evaluation.py and evaluation_org.py?

Thanks for reply. Before i start, it is good to hear that first table's performance is good.

The difference between evaluation.py and evaluation_org.py is not quite big. I just write my own cosine similiarity code for evaluation, because the orginial code wasn't worked.

I attached my error message here with evaluation.py to calc cosine similarity.

python3 evaluation.py --dataset FIVR-5K --pca_components 1024 --num_clusters 256 --num_layers 1 --output_dim 1024 --padding_size 64 --metric cosine --model_path models/model_v5_with_all_bg.pth --feature_path pre_processing/fivr_imac_pca1024.hdf5 --random_sampling 
Comparator is ...  False
loading features...
...features loaded
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 156.28it/s]
  0%|                                                                                                                                                                                                                                                                    | 0/5000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "evaluation.py", line 331, in <module>
    main()
  File "evaluation.py", line 327, in main
    eval_function(model, dataset, args)
  File "evaluation.py", line 258, in query_vs_database
    sims = calculate_similarities(queries, embedding, qr_video_dict, args.metric, comparator)
  File "evaluation.py", line 50, in calculate_similarities
    cdist(query_features, target_feature, metric='cosine'))
  File "/usr/local/envs/etri/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2717, in cdist
    raise ValueError('XA must be a 2-dimensional array.')
ValueError: XA must be a 2-dimensional array.

Other methods(euclidian, chamfer, sym_chamfer) are worked perfect. I added visil's pre_trained weight for video_comparator but it stopped while calculating.

python3 evaluation.py --dataset FIVR-5K --pca_components 1024 --num_clusters 256 --num_layers 1 --output_dim 1024 --padding_size 64 --metric chamfer --model_path models/model_v5_with_all_bg.pth --feature_path pre_processing/fivr_imac_pca1024.hdf5 --random_sampling --use_comparator
Comparator is ...  True
loading features...
...features loaded
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 505.95it/s]
 48%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                                                  | 2383/5000 [01:45<01:56, 22.49it/s]
Traceback (most recent call last):
  File "evaluation.py", line 331, in <module>
    main()
  File "evaluation.py", line 327, in main
    eval_function(model, dataset, args)
  File "evaluation.py", line 258, in query_vs_database
    sims = calculate_similarities(queries, embedding, qr_video_dict, args.metric, comparator)
  File "evaluation.py", line 56, in calculate_similarities
    sim = chamfer(query, target_feature, comparator)
  File "evaluation.py", line 71, in chamfer
    simmatrix = comparator(simmatrix).detach()
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kjlee/workspace/temporal_context_aggregation/model.py", line 620, in forward
    sim = self.mpool2(sim)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/modules/pooling.py", line 164, in forward
    self.return_indices)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/_jit_internal.py", line 405, in fn
    return if_false(*args, **kwargs)
  File "/usr/local/envs/etri/lib/python3.7/site-packages/torch/nn/functional.py", line 718, in _max_pool2d
    return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: Given input size: (64x22x1). Calculated output size: (64x11x0). Output size is too small

The two problems I sent above are the problems I have now. I think it is structured in accordance with what is in the paper, and as I said above, other parts are parts that do not affect performance. In this situation, can you provide "pre_trained_model" or provide accurate parameter values? And please check whether the cosine similarity calculation code works.

I think you have evaluated the frame-level feature with cosine similarity, which is for video-level feature according to 4.2 similarity measure of the paper.