yeliudev / R2-Tuning

🌀 R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)
http://arxiv.org/abs/2404.00801
BSD 3-Clause "New" or "Revised" License
52 stars 1 forks source link

inference demo: negative saliency value #15

Closed lexilii closed 6 days ago

lexilii commented 1 week ago

I'm testing the inference demo code with the default setting (r2_tuning_qvhighlights). While printing the value of output['_out']['saliency'], I found that it has negative values. Is this reasonable?

Furthermore, if I want to use this model for video retrieval task, should I use the average saliency to represent the clip score?

output['_out']['saliency']

tensor([-0.0343, -0.0360, -0.0387, -0.0397, -0.0320, -0.0345, -0.0370, -0.0371, -0.0331, -0.0292, -0.0297, -0.0313, -0.0316, -0.0310, -0.0315, -0.0331, -0.0314, -0.0298, -0.0303, -0.0318, -0.0339, -0.0335, -0.0323, -0.0289, -0.0274, -0.0262, -0.0271, -0.0275, -0.0330, -0.0315, -0.0349, -0.0313, -0.0492, -0.0390, -0.0345, -0.0321, -0.0230, -0.0220, -0.0190, -0.0137, -0.0236, -0.0219, 0.0140, 0.0164, 0.0221, 0.0272, 0.0253, 0.0258, -0.0045, 0.0361, 0.0355, 0.0158, 0.0122, 0.0146, 0.0366, 0.0380, 0.0433, 0.0294, 0.0306, 0.0315, 0.0392, 0.0407, 0.0425, 0.0346, 0.0381, 0.0396, 0.0455, 0.0410, 0.0397, 0.0432, 0.0465, 0.0310, 0.0261, 0.0397, 0.0332, 0.0335, 0.0299, 0.0208, 0.0195, 0.0373, 0.0143, 0.0156, 0.0156, 0.0274, 0.0172, 0.0033, 0.0055, -0.0062, -0.0117, -0.0114, -0.0148, -0.0246, -0.0299, -0.0326, -0.0399, -0.0376, -0.0376, -0.0347, -0.0523, -0.0589, -0.0879, -0.0853, -0.0893, -0.1123, -0.0485, -0.0520, -0.0483, -0.0528, -0.0507, -0.0532, -0.0555, -0.0542, -0.0322, -0.0324, -0.0365, -0.0354, -0.0324, -0.0307, -0.0342, -0.0301, -0.0334, -0.0333, -0.0315, -0.0329, -0.0342, -0.0314, -0.0303, -0.0315, -0.0285, -0.0311, -0.0360, -0.0343, -0.0298, -0.0315, -0.0357, -0.0285, -0.0302, -0.0330, -0.0343, -0.0321, -0.0286, -0.0274, -0.0325, -0.0319, -0.0391, -0.0388, -0.0364, -0.0412], device='cuda:0')

yeliudev commented 1 week ago

Yes the negative scores are normal as they haven't been normalized. You may refer to our demo code for normalization.

For video retrieval, you may consider using either averaged or max saliency scores.