Closed echatzikyriakidis closed 1 year ago
Also, this seems to have the limitation of 512 by default. Can we increase it to 1024 or more?
Hello @echatzikyriakidis, you can set output_max_length
argument to None
since the code checks for it here: https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/data_utils.py#L745
This will allow for the data to contain arbitrary sequence lengths. Also, the code is implemented to adjust to the longest sequence automatically and set the relational_max_length
variable, see: https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L965
and update the encoder/decoder model positional encoding parameters to support the change:
This means that the model is able to handle any sequence lengths. Setting the output_max_length=None
is reasonable if your expected sequence length is somewhat homogeneous. Otherwise, the training time might suffer heavily since training batches may unnecessarily be padded if one observation is quite long than the rest. In any case, it's a trade-off that you can test if the information from the long sequence observations is important for your use case. 😀
Hi @avsolatorio,
It sounds great. By exploring the distribution of the cardinality in the table relationships someone could specify the output_max_length
separately for each relational model to achieve the best of both worlds (performance speed/loss of information).
However, I still have a question related to it. How this functionality works since GPT2 has a limit, right? I think it has a maximum input/output length of 1024 tokens.
@echatzikyriakidis, the pre-trained GPT2 has a max token limit. However, REaLTabFormer only uses the architecture of GPT2 and does not resume from the pre-trained GPT2 model. So we can customize the context length as we need to when training the model. 😀
Thank you!
Hi @avsolatorio,
I see there is a limit on max output length and whenever we exceed it we skip the training example. Can we somehow overcome this problem and be able to learn generating examples with many children rows?
https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/data_utils.py#L755