Closed AmberzzZZ closed 3 years ago
Yes. In our implementation (which we use in our training process), we cut & mix the tokens with the flipped ones. https://github.com/zihangJiang/TokenLabeling/blob/2e221d24fef15e14f467ba02fd800f81ed9ef5df/models/lvvit.py#L192-L199 And we then paste back the tokens after going through the transformer layers, which is equivalent to cut & mix the corresponding dense label maps as described in our paper. https://github.com/zihangJiang/TokenLabeling/blob/2e221d24fef15e14f467ba02fd800f81ed9ef5df/models/lvvit.py#L213-L219
Hope this answers your question.
in ur released lvvit.py code, mixtoken is implemented by cut & mix the origin gridmap and the flipped one, with labels no need to change, which is not as described in the paper. is this what you actually did during the training process?