worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
212 stars 24 forks source link

Python datetime.date data type is handled as str and datatype handling in general #64

Open efstathios-chatzikyriakidis opened 9 months ago

efstathios-chatzikyriakidis commented 9 months ago

Hi @avsolatorio,

In one of my recent tests, I utilized a dataframe containing Python's datetime.date within a column. As expected, Pandas identifies this column's data type as object, a common outcome for anything beyond int/float/datetime.

I believe it's important not to automatically convert object types like this into strings, as it can lead to issues. For instance, in my case, the model learns to treat the date column as a string, resulting in string outputs rather than preserving the original datatype or generating new values. Converting Python's dates to timestamps could offer a solution, treating them as numerical data and enabling the generation of new values for synthetic data.

Past experiences (https://github.com/worldbank/REaLTabFormer/issues/33, https://github.com/worldbank/REaLTabFormer/issues/31, https://github.com/worldbank/REaLTabFormer/issues/36) have shown us various issues related to datatype handling. To address these comprehensively, I propose implementing a dictionary parameter in the fit method or REalTabFormer constructor. This parameter would allow us to specify which columns are categorical, regardless of their datatype. If a column is marked as categorical, the library would treat it as a string (regardless of its true datatype), preserving the original datatype, generating new values, and returning the new values back to the user with the original datatype, rather than strings. If a column is not mark as categorical in order to be treated as such it has to be a string column. In case the column is not categorical (e.g. int, float, timestamp, date, whatever), we need to handle it as non-string, generate new values, and preserving the original datatype in the output.

The following requirements aim to address many of the issues we've encountered:

What do you think on this new functionality? I don't know how much difficult this is as it might affect many things.