Variational autoencoder design

wshilton commented 1 year ago

For purposes of utterance encoding, gesture encoding and facial expression encoding, we shall first appeal to a nonlinear solver technique for extracting the relevant features due to what is expected to be a convergence issue in linear routines like Monte Carlo estimation. Current paper of interest that identifies a candidate NLS technique is https://arxiv.org/pdf/1312.6114.pdf

Next steps should involve determining the suitability of the proposed scheme for adaptation to our domains of interest. Some additional literature review in advances in VAE NLS schemes in various domains should continue.

wshilton commented 1 year ago

A more recent method involving a VAE technique is the so-called VQ-MAE-S in https://arxiv.org/pdf/2304.11117.pdf with a demonstrated application for utterance encoding. VQ-MAE-S can be directly integrated with an LLM for subtext analysis. The current proposal is to also consider integration of a VQ-MAE-S variant into a preexisting VITS framework for subtext generation. Generalizations of these schemes beyond acoustic data is now to be considered. The initial conceptual question here is: should the generalization be coherent across the larger latent space?

wshilton commented 1 year ago

If differentiability is preserved in the larger latent space, the concern is that the NLS approaches, ceteris paribus, will suffer from variance problems leading to convergence issues. Alternatively, if differentiability is sacrificed, independent downstream processing would be required to resolve the disintegrated features. The latter case leads to likely untenably complex system-level architectures.

wshilton commented 1 year ago

If differentiability is preserved in the larger latent space, the concern is that the NLS approaches, ceteris paribus, will suffer from variance problems leading to convergence issues. Alternatively, if differentiability is sacrificed, independent downstream processing would be required to resolve the disintegrated features. The latter case leads to likely untenably complex system-level architectures.

A tailored VAE that preserves differentiability is the most judicious choice. The resulting problem now involves equipping the model architecture with a cross-modal capability. Some related work has been done in this direction with the so-called VQ-MDVAE in https://arxiv.org/pdf/2305.03582.pdf.

wshilton commented 1 year ago

Concerning encoding, VQ-VAE's functionality in VQ-MDVAE is partially subsumed by our pre-trained face and pose landmark algorithms. For our purposes, MDVAE works to specialize VQ-VAE by producing latent representations corresponding to features that exist in product spaces, namely the time-audio-visual, audio-visual, time-audio, and time-visual domains. Relative to an intermediate model referred to as DSAE, the essence of the VQ-MDVAE is a demonstration that nontemporal entangled features, together with a causal restriction, can also be disentangled.

wshilton commented 1 year ago

Concerning encoding, VQ-VAE's functionality in VQ-MDVAE is partially subsumed by our pre-trained face and pose landmark algorithms. For our purposes, MDVAE works to specialize VQ-VAE by producing latent representations corresponding to features that exist in product spaces, namely the time-audio-visual, audio-visual, time-audio, and time-visual domains. Relative to an intermediate model referred to as DSAE, the essence of the VQ-MDVAE is a demonstration that nontemporal entangled features, together with a causal restriction, can also be disentangled.

As a place to start for an encoder, the plan is to implement a VAE in Andrew’s post-processed sensory domain. Since the dimensionality of the data is already significantly reduced, the aim in this effort is to have facilities for unsupervised learning of relatively complicated gestures, particularly those which are entangled across domains. As was done in VQ-MDVAE, the plan is to adopt a dynamical VAE (DSAE) core as the starting point. Our work here will go one step beyond VQ-MDVAE by performing multi-scale learning on each of the domains and product domains.

wshilton commented 1 year ago

At issue with the DSAE model is that its ability to temporally disentangle is limited by prior predictives that are merely backward-looking in time. As a consequence, DSAE can disentangle static from dynamic temporal features, but no mechanism for treating additional time scales is provided. The alternative approach, which actually precedes DSAE's development, is an FHVAE approach, which models at a given scale with prior predictives defined by fixed length segments. Such a treatment is readily generalizable to multiple scales. For these reasons, some additional review of FHVAE and LSTMs is warranted.

wshilton / andrew

Variational autoencoder design #7