worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

Bug in model.sample() when column contains integer values while column type is string. #36

Open echatzikyriakidis opened 1 year ago

echatzikyriakidis commented 1 year ago

Hi @avsolatorio,

I had to recreate this issue because for some reason couldn't reopen the original one.

I have tested the fix from the main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.

I have added a zip with a notebook that demonstrates the case.

What do you think?

Originally posted by @echatzikyriakidis in https://github.com/worldbank/REaLTabFormer/issues/31#issuecomment-1635683412

echatzikyriakidis commented 11 months ago

Hi @avsolatorio !

Are there any news on this? The PR solution seems that is not working. The correct thing to do is to not try to parse columns containing strings as ints/floats/datetimes even if that is possible. If a column contain strings, it is a string column. We need this refactoring to let REalTabFormer handle the string/text columns as categorical and not generate new values because they will be parsed to int/float/datetime.

Maybe we could use the following functions in the library to identify if a pd.Series column is text, integer, float, etc. and only then behave accordingly.

def is_first_non_na_value_text(series_values):
    return isinstance(series_values.dropna() [0], str)

def is_first_non_na_value_integer(series_values):
    return isinstance(series_values.dropna() [0], (int, np.integer))

def is_first_non_na_value_numerical(series_values):
    return isinstance(series_values.dropna() [0], (float, np.float))

When data is loaded from databases (instead of loading them from CSVs) using pandas SQL sometimes the values are not python's int/float but numpy's int/float. So, that's why we have also np.integer/np.float in the above functions. The np.integer will match both np.int32 and np.int64 and np.float similarly will match both np.float16 and np.float32. The functions also check the first non-null value because this can also be possible as some columns might have missing values.

Is it possible to make this refactoring? Could you please help us on this?

Thanks!