👎👎Without Any Technical Novelty, Far Below NeurIPS 2024 Standards.👎👎

How is it possible for the paper to be accepted to NeurIPS 2024? Unbelievably bad paper.

MoGenTS is yet another A + B paper, where A is a temporal spatial transformer and B is casual mask autoregressive generation.

1. The idea of temporal spatial modeling in human motion generation has been widely explored by FineMoGen, Motion Mamba, and De+Com. But MoGenTS does not cite any of these works.

FineMoGen has a novel Spatio-Temporal Mixture Attention design, rather than simply reshaping the temporal tensor and feeding it into a spatial transformer like MoGenTS. Similarly, spatial-temporal modeling in Motion Mamba is nothing but a very small contribution, but it sets off its novel hierarchical temporal and bidirectional spatial Mamba design. Moreover, De+Com enhanced spatial and temporal decomposition into long motion generation.

2. The causal masking is simply a replication of MoMask, MMM, and BAMM, lacking any significant contributions or novelty. The performance (<0.1 FID) is likely due to the strong codebase of mask autoregressive generation rather than the spatial-temporal modeling. And of course spatial-temporal modeling can provide a 0.01 FID improvement compared to MoMask, since this has already been verified by FineMoGen, Motion Mamba, and De+Com!

Can't believe that a paper of this level could still be accepted by NeurIPS in 2024!!! It really makes us rethink the quality of submissions and reviews at NeurIPS 2024.

You do not understand what this paper is talking about. The core of this paper is modeling each joint rather than the whole pose with a joint VQVAE. The temporal-spatial modeling is built on the joint token. This is not studied in FineMoGen, MotionMamba, or De+Com.
You do not understand what MoMosk or MMM is talking about. These papers are not based on causal masking. They are not autoregressive generation.
The paper gets unanimous acceptance from all reviewers. I do not think you are better than all of them.

You are such a joker. Do you dare say your real name?

Your words are really contradictory and quite hilarious. Let me refer to your own words.

You do not understand what this paper is talking about. The core of this paper is modeling each joint rather than the whole pose with a joint VQVAE. The temporal-spatial modeling is built on the joint token. This is not studied in FineMoGen, MotionMamba, or De+Com.

No matter how you claim your novelty is joint-level quantization, it doesn't hide the fact that most of your paper is just presenting another A + B approach, where A is masking and B is a spatio-temporal transformer with tensor rearrangement. It's understandable that papers like FineMoGen and Motion Mamba (De+Com which actually does spatial-temporal decomposition) could do such thing back then, but you are still doing the same thing but without mentioning any of these works is quite ugly.

You do not understand what MoMosk or MMM is talking about. These papers are not based on causal masking. They are not autoregressive generation.

That's unbelievably hilarious. Let me refer to their original papers:

MoMask: For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage.
MMM:
- ...a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens.
- To address this challenge, three predominant methods have been proposed, including (1) language-motion latent space alignment, (2) conditional diffusion model, and (3) conditional autoregressive model.
BAMM: The most hilarious is that BAMM's name is Bidirectional Autoregressive Motion Model.
- ...a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling...
- ..conditional masked self-attention transformer that leverages unidirectional and bidirectional casual masks to integrate autoregres- sive model and generative masked model into a unified framework...

It's not just that you don't understand; you haven't even read their papers, but accuse others don't understand... No wonder why you wrote this nutritionally poor paper.

The paper gets unanimous acceptance from all reviewers. I do not think you are better than all of them.

That is the saddest part. Congratulations! Like I mentioned, can't believe that a paper of this level could still be accepted by NeurIPS in 2024. It really makes us rethink the quality of submissions and reviews at NeurIPS 2024.

You are such a joker. Do you dare say your real name?

Thank you for calling me a joker, I will take that as a compliment. :) But you've certainly made yourself and your team well-known—now we all know who wrote this poor paper and contributed it to the community.

You are right, I'm the joker but you and your MoGenTS are real clowns.

I did not read BAMM so I did not say anything about BAMM.

But MoMask and MMM are all non-causal masking models. Let me teach you: autoregressive models are causal, while most masking models (including Momask and MMM) are non-causal.

This is how MMM introduced their method:

"Inspired by the success of autoregressive models in language and image generations, such as GPT [2], DALL-E [27] and VQ-GAN [7, 39, 42], autoregressive motion models, T2M-GPT [43], AttT2M [47] and MotionGPT [15], have been recently developed to further improve motion generation quality. However, these autoregressive models utilize the causal attention for unidirectional and sequential motion token prediction, limiting its ability to model bidirectional dependency in motion data, increasing the training and inference time, and hindering the motion editability. To address these limitations, we aim to exploit masked motion modeling for real-time, editable and highfidelity motion generation, drawing inspiration from the success of BERT-like masked language and image modeling"

What you referred to is just MMM introducing other methods:

To address this challenge, three predominant methods have been proposed, including (1) language-motion latent space alignment, (2) conditional diffusion model, and (3) conditional autoregressive model.

I do not only read their papers, I read every detail of the code of MoMask.

It is you that haven't even read their papers. How ridiculous.

For FineMoGen, Motion Mamba, De+Com, I didn't read them before. I cannot read every paper. I will certainly add them to the related work. Welcome to provide more papers. Spatio-temporal attention is well-used in computer vision, but how to use it is the difference. The spatial-temporal masking and attention in our paper is totally different from them, which is admitted by every reviewer and the AC. It's not your place to criticize.

I do not know what you experience, making you such a poor guy, barking like a dog in others's repo, afraid to reveal your real name.

weihaosky / mogents

👎👎Without Any Technical Novelty, Far Below NeurIPS 2024 Standards.👎👎 #1