Closed echatzikyriakidis closed 1 year ago
Hello @echatzikyriakidis, there is an implicit transformation of values into pd.Int64Dtype
intended to optimize the generation of numeric values. Indeed, this can cause a bug when categorical data is encoded as numerical values as you have. A patch can be implemented, but a quick fix for this is you first transform your column such that the values cannot be cast into numeric data. For example:
df["integer_as_str"] = "s_" + df["integer_as_str"]
This will ensure that the model will treat the data as an object type and not perform the implicit casting. You just need to perform the reverse transformation on the generated sample data.
Hi @avsolatorio,
Thank you for your fast reply!
This is exactly what I am currently doing in my implementation to solve it. However, it would be very convenient if the library could handle it so that we can remove the hack from the code.
Is it easy to fix it? Will this affect performance? Thanks!
Hello @echatzikyriakidis, I just pushed a PR to resolve this. Hopefully, it solves this problem. 😀
Thank you @avsolatorio ! I will test it and let you know.
Hi @avsolatorio,
I have tested the fix from main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.
I have added a zip with a notebook that demonstrates the case.
Hi @avsolatorio,
I think I have found a possible bug in tabular.sample() which might also be present in relational.sample().
I have trained a tabular model with a dataframe containing the columns below. When I sample the model and get the sampled data for some reason new values exist that are out of sample (cannot be found in the train data) in the integer_as_str column. I was expecting to see no new values because the column type is object and the underlying type is Python str. For the integer, float, datetime columns I can see that new values are generated which is fine for me.
Below, you will find a sample of the train dataframe:
What do you think of this? Is it a bug? Could you help us fix this?