Hi, yutong, firstly, very impressive work!
I have a question about the number of tokens generated in each inference step. Does LVM a) auto-regressively produces tokens one-by-one like normal LLM and then each 256 tokens are partitioned and grouped to decode an image? Or, b) directly generates 256 tokens in one step inference?
Hi, yutong, firstly, very impressive work! I have a question about the number of tokens generated in each inference step. Does LVM a) auto-regressively produces tokens one-by-one like normal LLM and then each 256 tokens are partitioned and grouped to decode an image? Or, b) directly generates 256 tokens in one step inference?