worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
203 stars 23 forks source link

Bug in model.sample() when column contains integer values while column type is string. #31

Closed echatzikyriakidis closed 1 year ago

echatzikyriakidis commented 1 year ago

Hi @avsolatorio,

I think I have found a possible bug in tabular.sample() which might also be present in relational.sample().

I have trained a tabular model with a dataframe containing the columns below. When I sample the model and get the sampled data for some reason new values exist that are out of sample (cannot be found in the train data) in the integer_as_str column. I was expecting to see no new values because the column type is object and the underlying type is Python str. For the integer, float, datetime columns I can see that new values are generated which is fine for me.

Below, you will find a sample of the train dataframe:

| integer_as_str | integer | float | boolean | datetime            | string |
|----------------|---------|-------|---------|---------------------|--------|
| 3              | 6214    | 54.09 | false   | 2002-10-15 03:07:53 | qyjib  |
| 31             | 2997    | 39.15 | false   | 1999-05-18 01:09:18 | mjuvv  |
| 38             | 3362    | 52.91 | true    | 1999-08-27 10:44:03 | ffskd  |
| 47             | 2286    | 50.68 | false   | 1999-02-02 05:48:06 | evqml  |
| 24             | 14482   | 77.8  | true    | 2001-09-08 13:56:20 | wieai  |

What do you think of this? Is it a bug? Could you help us fix this?

avsolatorio commented 1 year ago

Hello @echatzikyriakidis, there is an implicit transformation of values into pd.Int64Dtype intended to optimize the generation of numeric values. Indeed, this can cause a bug when categorical data is encoded as numerical values as you have. A patch can be implemented, but a quick fix for this is you first transform your column such that the values cannot be cast into numeric data. For example:

df["integer_as_str"] = "s_" + df["integer_as_str"]

This will ensure that the model will treat the data as an object type and not perform the implicit casting. You just need to perform the reverse transformation on the generated sample data.

echatzikyriakidis commented 1 year ago

Hi @avsolatorio,

Thank you for your fast reply!

This is exactly what I am currently doing in my implementation to solve it. However, it would be very convenient if the library could handle it so that we can remove the hack from the code.

Is it easy to fix it? Will this affect performance? Thanks!

avsolatorio commented 1 year ago

Hello @echatzikyriakidis, I just pushed a PR to resolve this. Hopefully, it solves this problem. 😀

echatzikyriakidis commented 1 year ago

Thank you @avsolatorio ! I will test it and let you know.

echatzikyriakidis commented 1 year ago

Hi @avsolatorio,

I have tested the fix from main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.

I have added a zip with a notebook that demonstrates the case.