Closed PanXiebit closed 3 years ago
The generation of the next token does depend on all previous tokens. The sampling code uses caching to make sampling more efficient / faster. Specifically, as each token is generated, the intermediate hidden units are stored and reused at later timesteps so they don't have to be re-calculated. The relevant caching code can be found here
thank you @wilson1yan! you are right.
https://github.com/wilson1yan/VideoGPT/blob/d157da51b3b9766648eb1e54a1008ff965e26b65/videogpt/gpt.py#L97-L107
hi, @wilson1yan! In these lines, it seems that the iterative generation of the next code only depends on the one previous time step? The shape of embeddings_slice is always [bs, 1, 1, 1, embed_dim].