Bug in model.sample() when column contains integer values while column type is string.

Hi @avsolatorio !

Are there any news on this? The PR solution seems that is not working. The correct thing to do is to not try to parse columns containing strings as ints/floats/datetimes even if that is possible. If a column contain strings, it is a string column. We need this refactoring to let REalTabFormer handle the string/text columns as categorical and not generate new values because they will be parsed to int/float/datetime.

Maybe we could use the following functions in the library to identify if a pd.Series column is text, integer, float, etc. and only then behave accordingly.

def is_first_non_na_value_text(series_values):
    return isinstance(series_values.dropna() [0], str)

def is_first_non_na_value_integer(series_values):
    return isinstance(series_values.dropna() [0], (int, np.integer))

def is_first_non_na_value_numerical(series_values):
    return isinstance(series_values.dropna() [0], (float, np.float))

When data is loaded from databases (instead of loading them from CSVs) using pandas SQL sometimes the values are not python's int/float but numpy's int/float. So, that's why we have also np.integer/np.float in the above functions. The np.integer will match both np.int32 and np.int64 and np.float similarly will match both np.float16 and np.float32. The functions also check the first non-null value because this can also be possible as some columns might have missing values.

Is it possible to make this refactoring? Could you please help us on this?

Thanks!

worldbank / REaLTabFormer

Bug in model.sample() when column contains integer values while column type is string. #36