wilson1yan / VideoGPT

MIT License
962 stars 115 forks source link

Could this be used as a next frame predictor? #16

Closed radi-cho closed 3 years ago

radi-cho commented 3 years ago

Do you have a minimal example of feeding the model with N frames and then reconstructing N + M frames or M frames only which however begin at the last frame of N? I would want to use it for next-frame-predictor.

wilson1yan commented 3 years ago

Yes, it should support that. The relevant argument for training the VideoGPT is --n_cond_frames. In your case, you want to train the VQ-VAE with --sequence_length N + M, and then train the VideoGPT with --sequence_length N + M --n_cond_frames N. Note that there is a constraint where you must have N + M be a power of 2, since the VQ-VAE downsample / upsamples by powers of 2.

You can see the pretrained BAIR model as an example, which is trained on sequences of length 16, conditioning on 1 frame and predicting 15.

radi-cho commented 3 years ago

Thanks for the quick response. And what advice would you give about optimizing the training. What is the cheapest possible way (in terms of computing power, GPUs, etc.) that you can train/fine tune a model. I have tried on my work laptop using torch cpu, but it eats all the memory and then crashes. Also I've tried with its GPU but the training seems stuck on 0% for hours. May I be doing something wrong or my hardware is just too weak. And finally - about the videos in the dataset, should they be in a specific frame rate (or is it beneficial to be in a particular one) and should I pre-process them before training in any way?

wilson1yan commented 3 years ago

The pretrained BAIR model was trained on 2 GPUs with 24GB of memory. It can maybe be trained on 1 GPU but will be slower. You definitely should not train on a CPU.

I recommend first training a few VQ-VAEs to see how much downsampling you can get without substantial loss in reconstruction accuracy. Since the more downsampling you have, the smaller the latent codes will be when feeding into the transformer, which will greatly reduce memory usage. You can also reduce the latent code size by reducing the sequence length of the original video input.

For the VQ-VAE, can also also probably comment out the Axial Attention part of the network, since it's not fully necessary, and consumes a greater part of the memory for the VQ-VAE as a whole.

For training the transformer, it may help to use sparse attention if you have sufficient hardware.

It shouldn't really matter what framerate the videos are, but just increases / decreases the complexity of the data distribution, since 16 frames of a faster framerate video would probably have less movement (easier to model) than 16 frames of a slower framerate video (harder to model since more is going on).

FrancoisPgm commented 3 years ago

When using the VQ-VAE with --sequence_length N + M, wouldn't that mean that the embeddings of the N first frames are contaminated with information about the M last ? Can't that allow the VideoGPT model to "cheat" by having a part of its desired output embedded in its input ?

wilson1yan commented 3 years ago

Yes, the each encoding would have information from all N + M frames of input. When training the VideoGPT model, we condition the original N video frames (i.e. pixels) to generate the encodings of the N + M frames, so the conditional information does not contain the other M frames.

The current code doesn't support it, but alternatively, you could train something like a frame-wise VQ-VAE and condition directly using the latents themselves.

FrancoisPgm commented 3 years ago

Oh I see the conditioning is not on the embeddings but directly on the frames. Thank you for the clarification. And thank you for this great repo.

As an other alternative way, I was thinking of training the VQ-VAE with --sequence_length N and when training the VideoGPT, encoding separately the N first frames and the N last (of the total N + M sequence), so I can condition on the embeddings of the N first but try to predict the embeddings of the N last. Do you think that would be a valid strategy ?

wilson1yan commented 3 years ago

Yes, that could also work

radi-cho commented 3 years ago

And where does the randomness of the model actually come from? I mean, in the paper you describe that from the same input image the model can generate different trajectories (e.g. in the BAIR Robot Pushing part), so is there some random noise or something and could you possibly point out where in the code it is?

wilson1yan commented 3 years ago

The randomness comes from the autoregressive sampling, found here

Mu-Yanchen commented 10 months ago

Yes, it should support that. The relevant argument for training the VideoGPT is --n_cond_frames. In your case, you want to train the VQ-VAE with --sequence_length N + M, and then train the VideoGPT with --sequence_length N + M --n_cond_frames N. Note that there is a constraint where you must have N + M be a power of 2, since the VQ-VAE downsample / upsamples by powers of 2.

You can see the pretrained BAIR model as an example, which is trained on sequences of length 16, conditioning on 1 frame and predicting 15.

Hello! Thanks for your work, I have a question about prediction, in VideoGPT (BAIR), how should I predict 15 frames from 1 frame?

I don't know whether I need to use the sample method(def sample(self, n, batch=None)) to complete the prediction. I see in the sample method that we need to pass the “batch” dictionary, which has "video" and "label" keys. May I ask how to construct the batch dictionary parameter for a frame