In one of my recent tests, I utilized a dataframe containing Python's datetime.date within a column. As expected, Pandas identifies this column's data type as object, a common outcome for anything beyond int/float/datetime.
I believe it's important not to automatically convert object types like this into strings, as it can lead to issues. For instance, in my case, the model learns to treat the date column as a string, resulting in string outputs rather than preserving the original datatype or generating new values. Converting Python's dates to timestamps could offer a solution, treating them as numerical data and enabling the generation of new values for synthetic data.
Past experiences (https://github.com/worldbank/REaLTabFormer/issues/33, https://github.com/worldbank/REaLTabFormer/issues/31, https://github.com/worldbank/REaLTabFormer/issues/36) have shown us various issues related to datatype handling. To address these comprehensively, I propose implementing a dictionary parameter in the fit method or REalTabFormer constructor. This parameter would allow us to specify which columns are categorical, regardless of their datatype. If a column is marked as categorical, the library would treat it as a string (regardless of its true datatype), preserving the original datatype, generating new values, and returning the new values back to the user with the original datatype, rather than strings. If a column is not mark as categorical in order to be treated as such it has to be a string column. In case the column is not categorical (e.g. int, float, timestamp, date, whatever), we need to handle it as non-string, generate new values, and preserving the original datatype in the output.
The following requirements aim to address many of the issues we've encountered:
Preservation of the original Python datatype in synthetic data, rather than relying solely on Pandas' types.
Ability to designate columns as categorical explicitly through the dictionary parameter.
Avoidance of attempting to parse string columns into other datatypes like int, treating them as strings instead. A string is a string.
Trusting the datatype information inherent in the data itself over Pandas' interpretation.
What do you think on this new functionality? I don't know how much difficult this is as it might affect many things.
Hi @avsolatorio,
In one of my recent tests, I utilized a dataframe containing Python's
datetime.date
within a column. As expected, Pandas identifies this column's data type as object, a common outcome for anything beyond int/float/datetime.I believe it's important not to automatically convert object types like this into strings, as it can lead to issues. For instance, in my case, the model learns to treat the
date
column as a string, resulting in string outputs rather than preserving the original datatype or generating new values. Converting Python's dates to timestamps could offer a solution, treating them as numerical data and enabling the generation of new values for synthetic data.Past experiences (https://github.com/worldbank/REaLTabFormer/issues/33, https://github.com/worldbank/REaLTabFormer/issues/31, https://github.com/worldbank/REaLTabFormer/issues/36) have shown us various issues related to datatype handling. To address these comprehensively, I propose implementing a dictionary parameter in the fit method or REalTabFormer constructor. This parameter would allow us to specify which columns are categorical, regardless of their datatype. If a column is marked as categorical, the library would treat it as a string (regardless of its true datatype), preserving the original datatype, generating new values, and returning the new values back to the user with the original datatype, rather than strings. If a column is not mark as categorical in order to be treated as such it has to be a string column. In case the column is not categorical (e.g. int, float, timestamp, date, whatever), we need to handle it as non-string, generate new values, and preserving the original datatype in the output.
The following requirements aim to address many of the issues we've encountered:
What do you think on this new functionality? I don't know how much difficult this is as it might affect many things.