zhao-zilong / Tabula

Official git for "TabuLa: Harnessing Language Models for Tabular Data Synthesis"
29 stars 11 forks source link

Generated Data Shape always 0 #9

Open iamamiramine opened 3 months ago

iamamiramine commented 3 months ago

I am facing an issue when generating data using Tabula.

I trained Tabula on the following datasets:

  1. Census
  2. Fake Hotel Guests
  3. Adult
  4. Health
  5. News

However, when generating, the generation loop is stuck because generated data shape is always 0 (num_samples is always greater than gen_data.shape[0]).

I tried re-training, and tried changing the max_length parameter in the sampling function, but it was of no help.

Can you please help me figure out how to fix this issue?

zhao-zilong commented 1 month ago

Hi @iamamiramine sorry that I just saw your message. Did you solve it? The reason can be that your max_length is too small so that the generation cannot successfully generate one complete row of data.

iamamiramine commented 1 month ago

Hello, I tried changing the max_length parameter and it did not work. Another thing to note is that Fake Hotel Guests dataset consists of 9 columns, so one row from this dataset is relatively short.

omaralvarez commented 3 days ago

I am also having problems with this, I am using max_length=1024 the maximum, if use more I get a CUDA error, in this dataset I can not get a single sample:

from imblearn.datasets import fetch_datasets

sick = fetch_datasets()['sick']
sick.data.shape
zhao-zilong commented 3 days ago

Hi @omaralvarez @iamamiramine

You do not need to set the max_length to 1024 that big, you can uncomment this part of code to see what is the length of your encoded row:

https://github.com/zhao-zilong/Tabula/blob/3869567c681b6c7cb1051bed75b5cb9ccfd2fa3a/tabula/tabula_dataset.py#L64

Let me know if that helps.

omaralvarez commented 1 day ago

Yes, I don't think it has to do with max_length, the issue in this case is that some numbers always are outside of the requested ranges in the predicted dataframe, so they are always filtered out. I have tried to switch temperature, k, and training epochs to no avail.