zhiwei-liang / MAXFormer

Apache License 2.0
5 stars 0 forks source link

local attention and global attention #1

Open AmariJane opened 11 months ago

AmariJane commented 11 months ago

Hi, first of all, thank you for your work. I have a question:

global_x = Rearrange('b d (x w1) (y w2) -> b x y w1 w2 d', w1=w, w2=w)(x) global_x = self.grid_attn(global_x) global_x = Rearrange('b x y w1 w2 d -> b d (w1 x) (w2 y)')(global_x) res.append(global_x)

    local_x = Rearrange('b d (x w1) (y w2) -> b x y w1 w2 d', w1=w, w2=w)(x)
    local_x = self.block_attn(local_x)
    local_x = Rearrange('b x y w1 w2 d -> b d (x w1) (y w2)')(local_x)
    res.append(local_x)

In the above code, I'm having a hard time finding the difference between local attention and global attention. I will be grateful if you can answer my query

zhiwei-liang commented 11 months ago

I am very sorry that I made a mistake in organizing the original code into code modules. I'm sorry to confuse you. Before calculating the global attention mechanism, it is necessary to divide the Windows. Here, the strategy of dividing the Windows is to set the input as (G×G, (H×W)/G×G,C), So the correct code would be global_x = Rearrange('b d (w1 x) (w2 y) -> b x y w1 w2 d', w1=w, w2=w)(x). Fortunately, the model parameters are based on previous code runs, so they are still available. Thank you for your interest and timely feedback on our project. If you have any other questions or find any other questions, please feel free to keep asking.

AmariJane commented 11 months ago

But your fix is the same as the source code. Can you explain how dividing the window into G*G is reflected in the code? Thank you!

zhiwei-liang commented 11 months ago

Hi, this is the efficient partitioning of tensor data through einops. For local attention, the tensor is actually divided into KxK Windows, and the partitioning operation is similar to the Swin Transformer. For global attention, we actually divide the tensor into a grid of GxG, similar to dilated convolutions. You can use the following example to understand how to use einops to partition the GxG grid.

  1. First, create tensor. Fill the first 4x4 block as 1, fill the second 4x4 block as 2,...
    
    import torch

tensor = torch.zeros(8, 8) tensor[:4, :4] = 1 tensor[:4, 4:] = 2 tensor[4:, :4] = 3 tensor[4:, 4:] = 4

the result tensor is

tensor([[1., 1., 1., 1., 2., 2., 2., 2.], [1., 1., 1., 1., 2., 2., 2., 2.], [1., 1., 1., 1., 2., 2., 2., 2.], [1., 1., 1., 1., 2., 2., 2., 2.], [3., 3., 3., 3., 4., 4., 4., 4.], [3., 3., 3., 3., 4., 4., 4., 4.], [3., 3., 3., 3., 4., 4., 4., 4.], [3., 3., 3., 3., 4., 4., 4., 4.]])


2. Then, einops is used to divide the global grid attention mechanism window

from einops import rearrange w=4 rearrange(tensor, '(w1 x) (w2 y) -> x y w1 w2', w1=w, w2=w)

the result tensor is

tensor([[[[1., 1., 2., 2.], [1., 1., 2., 2.], [3., 3., 4., 4.], [3., 3., 4., 4.]],

     [[1., 1., 2., 2.],
      [1., 1., 2., 2.],
      [3., 3., 4., 4.],
      [3., 3., 4., 4.]]],

    [[[1., 1., 2., 2.],
      [1., 1., 2., 2.],
      [3., 3., 4., 4.],
      [3., 3., 4., 4.]],

     [[1., 1., 2., 2.],
      [1., 1., 2., 2.],
      [3., 3., 4., 4.],
      [3., 3., 4., 4.]]]])

3. Then, einops is used to divide the local window attention mechanism window

w=4 rearrange(tensor, '(x w1) (y w2) -> x y w1 w2', w1=w, w2=w)

the result tensor is

tensor([[[[1., 1., 1., 1.], [1., 1., 1., 1.], [1., 1., 1., 1.], [1., 1., 1., 1.]],

     [[2., 2., 2., 2.],
      [2., 2., 2., 2.],
      [2., 2., 2., 2.],
      [2., 2., 2., 2.]]],

    [[[3., 3., 3., 3.],
      [3., 3., 3., 3.],
      [3., 3., 3., 3.],
      [3., 3., 3., 3.]],

     [[4., 4., 4., 4.],
      [4., 4., 4., 4.],
      [4., 4., 4., 4.],
      [4., 4., 4., 4.]]]])

Hoping can help you.
AmariJane commented 11 months ago

Understanding! Thank you very much.