Bug when running tabular.fit() and tabular.sample() with CPU

ChristinaChr commented 1 year ago

Hello @avsolatorio,

There might be a bug when running tabular.fit() and tabular.sample() with device='cpu' (might also be a case in relational models, haven't tested).

I have trained a tabular model with CPU with a dataframe containing the columns in the following example. Their original data types were {integer_as_str: object[str], integer: int64, float: float64, boolean: bool, datetime: datetime64[ns], string: object[str]}.

integer_as_str	integer	float	boolean	datetime	string
03	6214	54.09	false	2002-10-15 03:07:53	qyjib
31	2997	39.15	false	1999-05-18 01:09:18	mjuvv
38	3362	52.91	true	1999-08-27 10:44:03	ffskd
47	2286	50.68	false	1999-02-02 05:48:06	evqml
24	14482	77.8	true	2001-09-08 13:56:20	wieai

In my case, I want to be able to generate only values that are present in the training data, indepedently of their type. In other words, I don't want to generate new values, that do not exist in training data.

In order to be able to achieve that, I have experimented with adding a letter in the beginning of each value (see transformation example below). What I was expecting was to see no new values in any of the columns. Instead, what I got were values of another data type (if we ignored a, b, etc). For example I got in datetime column a value of b_2997 (valid value but for another column!!), or I got in float column a value of e_1999-02-02 05:48:06 (again valid value but for another column!!)

integer_as_str	integer	float	boolean	datetime	string
a_03	b_6214	c_54.09	d_false	e_2002-10-15 03:07:53	f_qyjib
a_31	b_2997	c_39.15	d_false	e_1999-05-18 01:09:18	f_mjuvv
a_38	b_3362	c_52.91	d_true	e_1999-08-27 10:44:03	f_ffskd
a_47	b_2286	c_50.68	d_false	e_1999-02-02 05:48:06	f_evqml
a_24	b_14482	c_77.8	d_true	e_2001-09-08 13:56:20	f_wieai

Let me note here, that everything works as expected when both tabular.fit() and tabular.sample() run with device='cuda'. What do you think of this? Maybe this is a bug that happens only with CPU?

avsolatorio commented 1 year ago

Hello @ChristinaChr, this is interesting! Would you mind sharing a simple colab notebook that can reproduce this? Thank you!

ChristinaChr commented 1 year ago

Hello @avsolatorio,

Thanks for the quick response! I am attaching here a zip with the colab notebook, which has a working example for you to be able to reproduce. There is a section in the end where you can check if new values have been generated.

worldbank / REaLTabFormer

Bug when running tabular.fit() and tabular.sample() with CPU #33