yabufarha / ms-tcn

Other
214 stars 58 forks source link

Online prediction #24

Closed kkirtac closed 2 years ago

kkirtac commented 4 years ago

Hi @yabufarha ,

let me receive your suggestions for training/evaluating the model for online prediction purposes.

I have a per-frame label in my dataset, similar to the datasets you used in the paper. But my videos are quite long, i.e., from 30 minutes to 2 hrs.

For the specific case of online prediction, I am interested only predicting the label for the current time step, i.e., current frame. But, the model is capable of predicting labels for all time-steps at once. The naive solution would be to use frames[0:current], and repeat for the next current index, and use only the prediction in the last index of the predicted labels. But this brings huge computational burden, i.e., you have to forward every frame from the first time step to current time step to obtain a single prediction. It would computationally get worse as you move forward in time.

Do you have any suggestions for online prediction, including the training step? During training, I am simply performing random sampling (both random offset and length for each sample) to have a fixed number of samples from each video.

Best,

yabufarha commented 4 years ago

Hi @kkirtac ,

You can still predict the labels for all frames at the same time for the online prediction. You only need to modify the dilation layer by doubling the padding in the dilated convolution and then taking the first T frames from the output, where T is the number of frames in you input. Everything else remains the same.

kkirtac commented 4 years ago

Thanks @yabufarha .

Suppose that only frames [0:current] is available to me and I want to predict only for the current time step, and repeat the same for the next frame as soon as it is available. So, does it make sense to forward frames[0:current] from the beginning, for every new time step? Before, I tried using a fixed window size such as 30 seconds, during evaluation, and slide it accordingly as I got a new frame in online setting. I got dropped accuracy in comparison to offline performance, where I used full-length video as a sample both during training and evaluation.

So, I thought it would be better to use all the past information on every new time step, but it might not be the optimal way. But I haven't modified anything in your model architecture before.

Maybe I didn't get exactly what doubling the padding size would bring to me here? Should I do it only in evaluation mode, i.e., setting it just before starting evaluation and changing it back before starting new epoch of training?

yabufarha commented 4 years ago

What I meant is that you need to convert the dilated residual layer from an acausal layer to a causal. Of course this change would be for both training and evaluation time. One way to do that is to zero-pad the input of the convolution layer such that at each time step, the output is computed based on the past frames only as I mentioned previously.

kkirtac commented 4 years ago

I got the point thank you. But maybe this is just one point that would help me. Do you have any other suggestions to change in the way I train the model, or in the model architecture? Maybe just use a single label per sequence (the label of the last frame)? because I am only interested in predicting the label of the last frame given a sequence.

jszgz commented 4 years ago

@kkirtac Hello, I am interested in online prediction, too. Did you tried to do that? Do you now how to extract features from frames? Do I need to use I3D network or can I use other shallow network or even fisher vector? Accuracy, fps, and time delay in acceptable levels?

xjsxujingsong commented 2 years ago

hi @kkirtac @jszgz Any suggestions? I am doing similar task.

kkirtac commented 2 years ago

@kkirtac Hello, I am interested in online prediction, too. Did you tried to do that? Do you now how to extract features from frames? Do I need to use I3D network or can I use other shallow network or even fisher vector? Accuracy, fps, and time delay in acceptable levels?

I was able to make both per-frame prediction and per-sequence prediction. You can use any backbone network or feature extractor that gives you a feature vector per frame. Accuracy and fps was fine. I both tried with 1 and 5 fps and it works.

hi @kkirtac @jszgz Any suggestions? I am doing similar task.

Online prediction is possible as yabufarha explained. You just need to double the padding amount here, like using padding=dilation*2, and in the forward pass, after the ReLU step here, you need to collect the first half of the output, out = out[:, :, :-(self.dilation * 2)], The rest remains the same