Question about inference details

Hi, yutong, firstly, very impressive work! I have a question about the number of tokens generated in each inference step. Does LVM a) auto-regressively produces tokens one-by-one like normal LLM and then each 256 tokens are partitioned and grouped to decode an image? Or, b) directly generates 256 tokens in one step inference?

ytongbai / LVM

Question about inference details #14