xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models
https://arxiv.org/abs/2201.05966
Apache License 2.0
546 stars 58 forks source link

Getting the multi-task prefix weights #30

Closed base-y closed 2 years ago

base-y commented 2 years ago

I downloaded the hkunlp/from_all_T5_large_prefix_sql2text2 model but did not find the prefix weights used inside the model.

Where can I get the prefix weights obtained from multi-task prefix tuning and also the finetuned prefix weights on individual tasks?

Timothyxxx commented 2 years ago

Hi, Thanks for asking!

I printed the model architecture again, and got:

Model(
  (pretrain_model): T5ForConditionalGeneration(
    (shared): Embedding(32102, 1024)
    (encoder): T5Stack(
      (embed_tokens): Embedding(32102, 1024)
      (block): ModuleList(
        (0): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
                (relative_attention_bias): Embedding(32, 16)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (1): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (2): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (3): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (4): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (5): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (6): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (7): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (8): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (9): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (10): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (11): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (12): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (13): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (14): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (15): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (16): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (17): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (18): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (19): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (20): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (21): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (22): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (23): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
      (final_layer_norm): T5LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (decoder): T5Stack(
      (embed_tokens): Embedding(32102, 1024)
      (block): ModuleList(
        (0): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
                (relative_attention_bias): Embedding(32, 16)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (1): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (2): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (3): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (4): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (5): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (6): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (7): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (8): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (9): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (10): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (11): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (12): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (13): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (14): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (15): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (16): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (17): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (18): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (19): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (20): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (21): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (22): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (23): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerCrossAttention(
              (EncDecAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (2): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
                (wo): Linear(in_features=4096, out_features=1024, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
      (final_layer_norm): T5LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (lm_head): Linear(in_features=1024, out_features=32102, bias=False)
  )
  (wte): Embedding(10, 1024)
  (control_trans): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=49152, bias=True)
  )
  (wte_enc): Embedding(10, 1024)
  (control_trans_enc): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=49152, bias=True)
  )
  (wte_dec): Embedding(10, 1024)
  (control_trans_dec): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=49152, bias=True)
  )
  (dropout): Dropout(p=0.0, inplace=False)
)

And the

(wte): Embedding(10, 1024)
  (control_trans): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=49152, bias=True)
  )
  (wte_enc): Embedding(10, 1024)
  (control_trans_enc): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=49152, bias=True)
  )
  (wte_dec): Embedding(10, 1024)
  (control_trans_dec): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=49152, bias=True)
  )

part is the prefix weight.

About why we release the whole model(T5 weight and prefix-weight) instead of just prefix-weight is because we want user can directly download and use every single. (And i guess you draw the conclusion we don't have the prefix weight is because the size you saw? IMO).

Hope these information can help!

Thanks!

base-y commented 2 years ago

Hey, thanks for replying back. I loaded the sql2text model using the "AutoModel.from_pretrained" method in transformers. I looked at the model state dict and I just got encoder and Decoder layer weights (didn't find the prefix stuff). I will check again tomorrow. Thank you.

-------- Original Message -------- On Jun 11, 2022, 12:03 AM, Tianbao Xie wrote:

Hi, Thanks for asking!

I printed the model architecture again, and got:

Model( (pretrain_model): T5ForConditionalGeneration( (shared): Embedding(32102, 1024) (encoder): T5Stack( (embed_tokens): Embedding(32102, 1024) (block): ModuleList( (0): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) (relative_attention_bias): Embedding(32, 16) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (1): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (2): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (3): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (4): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (5): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (6): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (7): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (8): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (9): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (10): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (11): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (12): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (13): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (14): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (15): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (16): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (17): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (18): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (19): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (20): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (21): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (22): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (23): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (final_layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (decoder): T5Stack( (embed_tokens): Embedding(32102, 1024) (block): ModuleList( (0): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) (relative_attention_bias): Embedding(32, 16) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (1): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (2): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (3): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (4): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (5): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (6): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (7): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (8): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (9): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (10): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (11): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (12): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (13): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (14): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (15): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (16): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (17): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (18): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (19): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (20): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (21): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (22): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (23): T5Block( (layer): ModuleList( (0): T5LayerSelfAttention( (SelfAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (1): T5LayerCrossAttention( (EncDecAttention): T5Attention( (q): Linear(in_features=1024, out_features=1024, bias=False) (k): Linear(in_features=1024, out_features=1024, bias=False) (v): Linear(in_features=1024, out_features=1024, bias=False) (o): Linear(in_features=1024, out_features=1024, bias=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (2): T5LayerFF( (DenseReluDense): T5DenseReluDense( (wi): Linear(in_features=1024, out_features=4096, bias=False) (wo): Linear(in_features=4096, out_features=1024, bias=False) (dropout): Dropout(p=0.1, inplace=False) ) (layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (final_layer_norm): T5LayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (lm_head): Linear(in_features=1024, out_features=32102, bias=False) ) (wte): Embedding(10, 1024) (control_trans): Sequential( (0): Linear(in_features=1024, out_features=512, bias=True) (1): Tanh() (2): Linear(in_features=512, out_features=49152, bias=True) ) (wte_enc): Embedding(10, 1024) (control_trans_enc): Sequential( (0): Linear(in_features=1024, out_features=512, bias=True) (1): Tanh() (2): Linear(in_features=512, out_features=49152, bias=True) ) (wte_dec): Embedding(10, 1024) (control_trans_dec): Sequential( (0): Linear(in_features=1024, out_features=512, bias=True) (1): Tanh() (2): Linear(

Timothyxxx commented 2 years ago

Oh I see. Please don't use the from_pretrained to load the model, since the huggingface don't have the prefix-tuning model in their package. Please check out our colab to know how use our code to load weight!

Timothyxxx commented 2 years ago

Contact us if you have further questions!