worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
203 stars 23 forks source link

Is there a way to overcome output max length limitation? #11

Closed echatzikyriakidis closed 1 year ago

echatzikyriakidis commented 1 year ago

Hi @avsolatorio,

I see there is a limit on max output length and whenever we exceed it we skip the training example. Can we somehow overcome this problem and be able to learn generating examples with many children rows?

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/data_utils.py#L755

echatzikyriakidis commented 1 year ago

Also, this seems to have the limitation of 512 by default. Can we increase it to 1024 or more?

avsolatorio commented 1 year ago

Hello @echatzikyriakidis, you can set output_max_length argument to None since the code checks for it here: https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/data_utils.py#L745

This will allow for the data to contain arbitrary sequence lengths. Also, the code is implemented to adjust to the longest sequence automatically and set the relational_max_length variable, see: https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L965

and update the encoder/decoder model positional encoding parameters to support the change:

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L862

This means that the model is able to handle any sequence lengths. Setting the output_max_length=None is reasonable if your expected sequence length is somewhat homogeneous. Otherwise, the training time might suffer heavily since training batches may unnecessarily be padded if one observation is quite long than the rest. In any case, it's a trade-off that you can test if the information from the long sequence observations is important for your use case. 😀

echatzikyriakidis commented 1 year ago

Hi @avsolatorio,

It sounds great. By exploring the distribution of the cardinality in the table relationships someone could specify the output_max_length separately for each relational model to achieve the best of both worlds (performance speed/loss of information).

However, I still have a question related to it. How this functionality works since GPT2 has a limit, right? I think it has a maximum input/output length of 1024 tokens.

avsolatorio commented 1 year ago

@echatzikyriakidis, the pre-trained GPT2 has a max token limit. However, REaLTabFormer only uses the architecture of GPT2 and does not resume from the pre-trained GPT2 model. So we can customize the context length as we need to when training the model. 😀

echatzikyriakidis commented 1 year ago

Thank you!