worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

Transaction datetime in the child table is not sequential #45

Closed liu305 closed 9 months ago

liu305 commented 1 year ago

The transactions in the generated child table belonging to the same parent join key do not seem to be correctly sorted. The original data used for training is sorted though. So the model is not capturing the time sequential structure of different transactions of the same user in the raw dataset?

pan transactionDateTime 0 2020-07-11 03:05:47 0 2020-06-27 14:31:24 0 2020-06-07 01:06:45 0 2020-06-05 19:23:52 0 2020-02-25 12:20:27 0 2020-05-22 18:58:40 0 2020-06-06 10:15:20 0 2020-04-14 17:35:39 0 2020-03-03 08:51:58 0 2020-05-23 13:47:57 0 2020-04-12 05:12:54 0 2020-04-19 02:23:20

avsolatorio commented 1 year ago

Hello @liu305, the model should generally be able to learn that. However, it is possible that the model has not learned that aspect of the data yet when the training terminated.

Also, there's a factor of stochasticity. In my other comment, I mentioned that the datetime is converted into a timestamp and is modeled digit-by-digit. Since the inference is stochastic, there is a chance that the values for the year, month, day, etc., will not follow the sorting in the data.

In my view, these "quirks" in the generated data can actually be used as self-consistency filters. If this sorting issue happens only to some small number of the generated data, but not all, then you can use this information to throw these data away.

This applies to other systematic properties in your data that must be preserved. For example, if there's a column corresponding to the per capita expenditure, poverty line, and poverty status, in our case. Then, we can use these relationships to filter whether the generated data is self-consistent across these columns by checking if the generated per capita expenditure when compared against the poverty line matches the generated poverty status. 😀