worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
203 stars 23 forks source link

Can the model learn relationships of columns in long distance tables? #9

Closed echatzikyriakidis closed 1 year ago

echatzikyriakidis commented 1 year ago

@avsolatorio Hi!

Assume the following 3 tables:

Table A Column A.1 Column A.2 Column A.3

Table B Column B.1 Column B.2 Column B.3

Table C Column C.1 Column C.2 Column C.3

In my example I skip the primary and foreign keys for simplicity.

The relationships are:

Table A [1..N] Table B Table B [1..N] Table C

The parent Tabular model type can be used to model each table separately and the Relational model type can be used to model the above two relationships. The parent and child models will capture column relationships/correlations that exist inside each table or relationship. But what happens with correlations that exist in columns between Table A and Table C or any pair of columns from any pair of tables in the database? How someone can learn such relationships? Is it possible with RealTabFormer?

What if the Table A.Column A.1 is correlated with Table C.Column C.2? Each child model is conditioned only in the parent row and thus can learn dependencies only between directly connected tables.

What do you think? Is it possible to overcome this somehow?

avsolatorio commented 1 year ago

Hello @echatzikyriakidis, if I understand your use case correctly, you could implement a hierarchical data input.

If Table A is a non-relational table, you could use the standard tabular model to generate the synthetic data A'. Then, you can use the Seq2Seq model to fit Tables A and B to generate observations B'.

Then, assuming both values in A and B strongly correlate with observations in Table C, you could concatenate data from A and B and train a seq2seq model for C.

To generate synthetic samples, use the parent table model to generate A'. Then use the generated sample (A') on the first seq2seq model to generate B' observations. Concatenate A' and B' observations, and use these as input to the second seq2seq model to generate C'.

echatzikyriakidis commented 1 year ago

Thank you @avsolatorio,

That is a very smart way of solving the problem. Similar to the idea of n-gram language modelling where we can predict the next token based on previous N tokens. Similarly, here we can predict the child rows based on both parent and grandparent information.

Thanks!

limhasic commented 5 months ago

in

""" If Table A is a non-relational table, you could use the standard tabular model to generate the synthetic data A'. Then, you can use the Seq2Seq model to fit Tables A and B to generate observations B'. """

Tables A and B is Denormalized joined table?