worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
193 stars 22 forks source link

Bug when running tabular.fit() and tabular.sample() with CPU #33

Open ChristinaChr opened 1 year ago

ChristinaChr commented 1 year ago

Hello @avsolatorio,

There might be a bug when running tabular.fit() and tabular.sample() with device='cpu' (might also be a case in relational models, haven't tested).

I have trained a tabular model with CPU with a dataframe containing the columns in the following example. Their original data types were {integer_as_str: object[str], integer: int64, float: float64, boolean: bool, datetime: datetime64[ns], string: object[str]}.

integer_as_str integer float boolean datetime string
03 6214 54.09 false 2002-10-15 03:07:53 qyjib
31 2997 39.15 false 1999-05-18 01:09:18 mjuvv
38 3362 52.91 true 1999-08-27 10:44:03 ffskd
47 2286 50.68 false 1999-02-02 05:48:06 evqml
24 14482 77.8 true 2001-09-08 13:56:20 wieai

In my case, I want to be able to generate only values that are present in the training data, indepedently of their type. In other words, I don't want to generate new values, that do not exist in training data.

In order to be able to achieve that, I have experimented with adding a letter in the beginning of each value (see transformation example below). What I was expecting was to see no new values in any of the columns. Instead, what I got were values of another data type (if we ignored a, b, etc). For example I got in datetime column a value of b_2997 (valid value but for another column!!), or I got in float column a value of e_1999-02-02 05:48:06 (again valid value but for another column!!)

integer_as_str integer float boolean datetime string
a_03 b_6214 c_54.09 d_false e_2002-10-15 03:07:53 f_qyjib
a_31 b_2997 c_39.15 d_false e_1999-05-18 01:09:18 f_mjuvv
a_38 b_3362 c_52.91 d_true e_1999-08-27 10:44:03 f_ffskd
a_47 b_2286 c_50.68 d_false e_1999-02-02 05:48:06 f_evqml
a_24 b_14482 c_77.8 d_true e_2001-09-08 13:56:20 f_wieai

Let me note here, that everything works as expected when both tabular.fit() and tabular.sample() run with device='cuda'. What do you think of this? Maybe this is a bug that happens only with CPU?

avsolatorio commented 1 year ago

Hello @ChristinaChr, this is interesting! Would you mind sharing a simple colab notebook that can reproduce this? Thank you!

ChristinaChr commented 1 year ago

Hello @avsolatorio,

Thanks for the quick response! I am attaching here a zip with the colab notebook, which has a working example for you to be able to reproduce. There is a section in the end where you can check if new values have been generated.