Closed cinjon closed 5 years ago
Hi ‘duration second' is the duration of video in form of second, and 'duration frame' is the orignial number of video frames. During feature extraction, I adopt a 16 frame snippet, thus actually 16n frame is used during feature extraction. the 'feature_frame' is 16 feature_len.
In THUMOS-14, I adopt sliding window fashion to prepare data, you can leave you email here then I can send you corresponding codes.
No, in _get_base_data, feature of each video is load separately, not full data.
Apologies for the delay, but I don't quite understand everything. It appears that:
I am fine with those.
For feature_frame, though, I don't get it. Here's an example:
('v_--6bJUbfpnQ',
{'duration_second': 26.75,
'duration_frame': 647,
'annotations': [{'segment': [2.578755070202808, 24.914101404056165],
'label': 'Drinking beer'}],
'feature_frame': 624})
In that one, why is feature_frame 624?
I found that ~ featureFrame = len(readData(videoName))*16
in data_process.py
, but readData
references two csvs that are not otherwise referenced or in the directory. Is the temporal
/ spatial
directories that it is pulling from supposed to be flow
and rgb
? If so, then why are these concatenated before applying the 16x multiplier?
Overall, I just dont quite get how the faeture_frame works, but it's clearly important for computing the corrected_second
in training. If you could clarify this, I'd really appreciate it.
Last, I just want to verify - do I need to change anything in order to run the code AND the trained models on a dataset that has an arbitrary number of frames for each video?
@wzmsltw, friendly bump in case this got lost in the shuffle.
@cinjon Since I extract feature for each 16 frames snippet, so the corresponding frame number used for extracting feature is len(feature)*16 = feature_frame. corrected_second is adopted for result alignment. But acutally, this alignment has little impact on final result, you can directly use corrected_second = duration_second.
Another question - is it right that this codebase is setup only to work with feature vectors that cover the entire video? As far as I can tell, dataset.py needs to be adjusted in the scenario when the model does not get the entire interpolated video at once like is done with the 100 vectors for the ActivityNet videos in the paper. For example, it appears that the gt_bbox computation in _get_train_label should be changed so that the model is predicting only over the time duration given (say 120 seconds) rather than assuming that the time is over the full video_second.
Is that right or am I misunderstanding something?
(Ok, in that case I am going to ignore feature_frame and just treat it as the same as duration_frame.)
@cinjon how is your progress to try this code on THUMOS?
If I understand this correctly I have to extract the snippet level features using TSN (https://github.com/yjxiong/anet2016-cuhk). But the anet2016-cuhk is pretrained on activity net so you first have to finetune the network on THUMOS and then extract the snippet level features from THUMOS and do the TEM, PGM & finally the PEM training? Is this correct?
Hi there, thanks for releasing your code. I've went through it with the intention of adding a new dataset and, as far as I can tell, the main thing that needs to be done is to generate the video_anno file, which is a large json consisting of:
I understand that the annotations field is meant to be a list of {'label':, 'segment': [start, end]}, but can you verify what the other three are meant to be? It's not clear if duration_second is according to a normalized FPS or if it's just the timestamp in the video. It's also unclear what the difference is between duration_frame and feature_frame.
In what units is the start and end of segment, i.e. is it relative to the actual time in the video or a normalized time?
Additionally, I will not be modifying the video to be 100 frames each. It seems like you did that for ActivityNet but the paper doesn't mention anything similar for Thumos. What was your strategy for Thumos?
Finally, what's the story with video_df in _get_base_data? It seems like it loads the full data in every time. That's 11G uncompressed. Is this right?