unable to detect column names in generated data.

yandex-research / tab-ddpm

[ICML 2023] The official implementation of the paper "TabDDPM: Modelling Tabular Data with Diffusion Models"

https://arxiv.org/abs/2209.15421

MIT License

393 stars 86 forks source link

unable to detect column names in generated data. #13

Closed Sanchita333 closed 1 year ago

Sanchita333 commented 1 year ago

I am using provided churn dataset as input and in output I am getting generated categorical and numerical columns in .npy format....there are 4 categorical and 7 numerical columns . How to identify names of those columns?

shamikdhar commented 1 year ago

Screenshot from 2023-02-14 12-55-13

I am also facing the same issue. I have trained and sampled the model with my own data. But in the output all the generated data columns are shuffled and the column names are also not there. So how will I detect those column names from such outputs? @rotot0 or @erjanmx please help asap.

rotot0 commented 1 year ago

Hi, Sorry for the late answer. Please, see this answer https://github.com/rotot0/tab-ddpm/issues/3#issuecomment-1287205628

In short, there is no way to reconstruct column names back. All I can say is that it is very likely in order of columns from original .csv file. So, if in .csv file you have [num1, num2, cat1, num3, cat2], then X_num=[num1, num2, num3], X_cat=[cat1, cat2]. The easiest way is to do data partitioning yourself.

shamikdhar commented 1 year ago

@rotot0 I have already partitioned the whole csv data into X_num and X_cat myself. but after generation even the generated cat dataframe and num dataframe columns are suffled.. Please have a look to my above screenshot. They are only the original category dataframe and generated category dataframe. And in the generated categorical data all the columns have got suffled. Please fix this issue. otherwise the library is of no use.

rotot0 commented 1 year ago

@rotot0 I have already partitioned the whole csv data into X_num and X_cat myself. but after generation even the generated cat dataframe and num dataframe columns are suffled.. Please have a look to my above screenshot. They are only the original category dataframe and generated category dataframe. And in the generated categorical data all the columns have got suffled. Please fix this issue. otherwise the library is of no use.

@shamikdhar Sorry, but I cannot reproduce your problem in my experiments. The original and generated columns are aligned. It may be a bug on your side. Or provide additional code/info on you problem, please. Also, you might want to open another issue.