Open ChristinaChr opened 1 year ago
Hello @ChristinaChr, this is interesting! Would you mind sharing a simple colab notebook that can reproduce this? Thank you!
Hello @avsolatorio,
Thanks for the quick response! I am attaching here a zip with the colab notebook, which has a working example for you to be able to reproduce. There is a section in the end where you can check if new values have been generated.
Hello @avsolatorio,
There might be a bug when running tabular.fit() and tabular.sample() with device='cpu' (might also be a case in relational models, haven't tested).
I have trained a tabular model with CPU with a dataframe containing the columns in the following example. Their original data types were {integer_as_str: object[str], integer: int64, float: float64, boolean: bool, datetime: datetime64[ns], string: object[str]}.
In my case, I want to be able to generate only values that are present in the training data, indepedently of their type. In other words, I don't want to generate new values, that do not exist in training data.
In order to be able to achieve that, I have experimented with adding a letter in the beginning of each value (see transformation example below). What I was expecting was to see no new values in any of the columns. Instead, what I got were values of another data type (if we ignored a, b, etc). For example I got in datetime column a value of b_2997 (valid value but for another column!!), or I got in float column a value of e_1999-02-02 05:48:06 (again valid value but for another column!!)
Let me note here, that everything works as expected when both tabular.fit() and tabular.sample() run with device='cuda'. What do you think of this? Maybe this is a bug that happens only with CPU?