Synthesizing action trajectories with Langevin dynamics

yilundu / improved_contrastive_divergence

[ICML'21] Improved Contrastive Divergence Training of Energy Based Models

62 stars 14 forks source link

Synthesizing action trajectories with Langevin dynamics #16

Open andrewliu2001 opened 2 months ago

andrewliu2001 commented 2 months ago

Thanks for the great work! Are there any tips for training with the improved contrastive divergence objective? I'm trying to build a multi-modal robotic manipulation model that takes in videos and text prompts and Langevin samples action trajectories (think of it as Implicit Behavioral Cloning but multi-modal and denoising more than one time step). However, it is very difficult to push down the reconstruction loss during Langevin sampling.

Also, is there a specific reason why the attention mechanism is incompatible with EBMs? There doesn't seem to be any literature on this.

Thanks!

yilundu commented 2 months ago

Sorry, could you clarify what you mean by the reconstruction loss?

The attention mechanism doesn't give very good gradients on the input -- though in a unet block it can probably still be used in a EBM

andrewliu2001 commented 2 months ago

Reconstruction loss as in MSE between ground truth and Langevin sampled actions. On top of the loss proposed in this paper, I also augmented it with another loss that's just ||a_true - a_langevin|| but it doesn't seem to help much.
In that case, would a fully transformer-based architecture be problematic?

yilundu commented 2 months ago

1) So I think that reconstruction loss wouldn't make sense unless the actions are unimodal -- otherwise it would just blur answers between all predictions. Instead, you should use the approximate maximum likelihood objective which minimizes the energy of ground truth samples while increasing the energy of other samples.

2) I haven't had success stably training a transformer-based architecture so it could be difficult

andrewliu2001 commented 2 months ago

That makes sense. I need to double check the multi-modality of my dataset. Also, in that case, there's no way to assess the quality of sampling during training right? The max likelihood objective can be optimized well but the sampling might still not reach any mode due to flatness of the landscape.

yilundu commented 2 months ago

You can probably just execute the actions sampled from the model in the environment