Extending your code for poses

wilson1yan / VideoGPT

MIT License

962 stars 115 forks source link

This is a "support" request rather than a bug or a feature.

Idea

I have a "video" sequence that is represented as skeletal poses rather than video frames. Each pose was extracted from a video frame, and is now a tensor of shape [frames, keypoints, dimensions] such that a [100,137,2] tensor would be a 2D pose, of 137 keypoints, over 100 frames.

As there is no consistent spatial information between strides, we can imagine the dimensions to be equivalent to channels, and apply a full-size convolution for in_channels=2, out_channels=C, kernel_size=(F, 137), stride=S. (Where C, S, and F are hyperparameters, and can run for multiple layers)

After multiple layers of convolution, these representations will then be quantized, and be decoded in the reverse way.

Why fork from your library?

It seems stable
It works for the same sample rate - unlike audio based models

Support request:

While I can make the data loading model to load these tensors, and perform the necessary data augmentation, etc, I'm having some trouble understanding how to properly implement the convolutional encoder and decoder. (this is different as it is not downsampling, over a specially/temporally consistent input)

Could you please offer some guidance?

Thanks!

A few possible options off the top of my head:

1) You could try just treating the keypoints as "spatially consistent" and running a CNN autoencoder / dnecoder to see how well that reconstructs your keypoints.

2) One option may be in the area of graph convolutions and graph autoencoders if you treat your keypoint structure as a graph. Though I'm not too familiar with that area to tell you exactly how you'd do it or how those architectures work.

3) If you're aren't that bottlenecked by compute, you could just directly quantize your keypoint data into finegrained enough bins and directly model that using a transformer. e.g. your input to the transformer would be of shape (100, 137, 2, n_quantization_bins).

4) Alternatively, there are a few papers (e.g. this one) that train video prediction models over keypoints and work pretty well. They essentially train a VAE over keypoint data, and do have encoder / decoder architecture so that may also help.

wilson1yan / VideoGPT