Open andrewliu2001 opened 2 months ago
Sorry, could you clarify what you mean by the reconstruction loss?
The attention mechanism doesn't give very good gradients on the input -- though in a unet block it can probably still be used in a EBM
Reconstruction loss as in MSE between ground truth and Langevin sampled actions. On top of the loss proposed in this paper, I also augmented it with another loss that's just ||a_true - a_langevin|| but it doesn't seem to help much.
In that case, would a fully transformer-based architecture be problematic?
1) So I think that reconstruction loss wouldn't make sense unless the actions are unimodal -- otherwise it would just blur answers between all predictions. Instead, you should use the approximate maximum likelihood objective which minimizes the energy of ground truth samples while increasing the energy of other samples.
2) I haven't had success stably training a transformer-based architecture so it could be difficult
That makes sense. I need to double check the multi-modality of my dataset. Also, in that case, there's no way to assess the quality of sampling during training right? The max likelihood objective can be optimized well but the sampling might still not reach any mode due to flatness of the landscape.
You can probably just execute the actions sampled from the model in the environment
Thanks for the great work! Are there any tips for training with the improved contrastive divergence objective? I'm trying to build a multi-modal robotic manipulation model that takes in videos and text prompts and Langevin samples action trajectories (think of it as Implicit Behavioral Cloning but multi-modal and denoising more than one time step). However, it is very difficult to push down the reconstruction loss during Langevin sampling.
Also, is there a specific reason why the attention mechanism is incompatible with EBMs? There doesn't seem to be any literature on this.
Thanks!