worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

Could order of columns affect performance of synthetic data quality? #65

Open efstathios-chatzikyriakidis opened 6 months ago

efstathios-chatzikyriakidis commented 6 months ago

Hi @avsolatorio!

Could order of columns (first categorical, then numerical/datetime) or the opposite (first numerical/datetime, then categorical) could affect quality of synthetic data? Furthermore in categorical could be ordered more by cardinality. Correlations exist on all columns and I am thinking if putting first the categoricals or not, or sorting categoricals by ascending or descending will allow better learning or not.

Thanks!

echatzikyriakidis commented 5 months ago

I have done some tests and it seems that it doesn't matter. Similar results observed for each possible case of first or last categorical columns and with increasing and decreasing cardinality as well.

echatzikyriakidis commented 5 months ago

Can be closed