worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
203 stars 23 forks source link

Relational Training: CUDA OOM Out of Memory Error #22

Closed solitaryangler closed 1 year ago

solitaryangler commented 1 year ago

Hi,

Thanks for developing and releasing this codebase. I'm using it to train on a tabular data. I tried both in Tabular format and Relational format. But in the Relational format I'm getting CUDA OOM (Out of Memory Error).

Original Table (Raw data): The original table has only 8 cols x 10,000 rows (which I have subsampled for testing). The model in Tabular mode trains perfectly fine and I am able to generate synthetic samples.

Relational Table Format (Parent / Child): In the relational format the tables have the following statistics:

However, in this case:

I have tried this on GCP with

I suspect the Relational format Child model fails because it requires both the Parent & Child tables to be loaded into GPU memory. But the dataset is tiny. How can I overcome the OOM error?

Do you have any suggestions?

avsolatorio commented 1 year ago

Hi @solitaryangler , you are correct; I think the issue is with the Relational model. In particular, it may be due to the large number of child rows a single parent row has. You can try setting the output_max_length to some number like 1024 or 2048 to see if it prevents the OOM. What this does, however, is it will limit the number of children that will be used in training the model.

Also, you can try experimenting with not using the trained parent model and instead training the full relational model from scratch. Just set parent_realtabformer_path=None. We got better results training the full relational model from scratch in one of our experiments. :)

So you can have something like:

child_model = REaLTabFormer(
    model_type="relational",
    parent_realtabformer_path=None,
    output_max_length=2048,
    train_size=0.8)

Hope this helps!

solitaryangler commented 1 year ago

Hi @avsolatorio

Thanks for your comments. You were absolutely right. Once I limited the child entries per parent row, then the model trained fine. I am closing this issue as resolved.

My apologies on the late reply, I was traveling.

solitaryangler commented 1 year ago

Closed

avsolatorio commented 1 year ago

Hello @solitaryangler, no worries! I am so glad that it's working now. 😀