yandex-research / tab-ddpm

[ICML 2023] The official implementation of the paper "TabDDPM: Modelling Tabular Data with Diffusion Models"
https://arxiv.org/abs/2209.15421
MIT License
397 stars 89 forks source link

Trouble training/sampling on data with high-cardinality categorical features #23

Open reed-peterson-947 opened 1 year ago

reed-peterson-947 commented 1 year ago

I've had success in training/generating data with this package on a variety of different datasets, but I have noticed when there is a very high cardinality feature present in a dataset this package fails with a very uninformative error message: "Killed" and nothing else. As soon as I remove the high-cardinality feature, it runs fine. By high-cardinality I mean on the order of tens of thousands of unique values for a given column. Not sure how to debug or where to start given the uninformative nature of the error message. The last line of code that seems to be executed before it gets killed is line 579 in lib/data,py. Any ideas? Anyone else have this same issue?

rotot0 commented 1 year ago

Hello,

I am not sure, but maybe you are out of RAM due to OneHotEncoder and high-cardinality of features

paulduf commented 1 year ago

Moreover, even such a sophisticated model won't make magic out of a categorical feature with so many modalities ... unless you have millions of rows, and even in this case I bet you'll have many modalities unrepresented in the synthetic data. So you could try to pre-process this column with domain-based knowledge ?