yandex-research / tab-ddpm

[ICML 2023] The official implementation of the paper "TabDDPM: Modelling Tabular Data with Diffusion Models"
https://arxiv.org/abs/2209.15421
MIT License
393 stars 86 forks source link

Dataset inquiry #34

Closed iamamiramine closed 5 months ago

iamamiramine commented 5 months ago

May I ask what X_cat, X_num and y refer to? Can you also please explain "is_y_cond"?

JiangLei1012 commented 5 months ago

is_y_cond -- false for regression, true for classification

JiangLei1012 commented 5 months ago

X_cat for categorial attributes, and X_num is numerical attributes, Y is labels.

iamamiramine commented 5 months ago

Thank you so much!

JiangLei1012 commented 5 months ago

I have an issue, If I want to train and test on a new dataset, how do I proceed? Do I need to write my own initial configuration file?

iamamiramine commented 5 months ago

You can modify def read_pure_data to suit your case. In my case, I want it to read from a CSV file so I placed my CSV file inside my _real_data_path_ and inside read_pure_data I repalced the code with the following

def read_pure_data(real_data_path, y_column, cat_columns=None, num_columns=None):
    data = pd.read_csv(real_data_path)
    X_cat = data.loc[cat_columns].to_numpy()
    X_num = data.loc[num_columns].to_numpy()
    y = data[y_column].to_numpy()
    return X_num, X_cat, y
JiangLei1012 commented 5 months ago

Thank you so much! If I modify 'def read_pure_data', a lot of other code has to be changed as well, which I don't really understand.Can you give me your complete code? I would appreciate it if you could.

iamamiramine commented 5 months ago

@JiangLei1012 Can you send me your email?

JiangLei1012 commented 5 months ago

@iamamiramine Of course! My pleasure! My email is "jiangleijl@outlook.com", I'm looking forward to your letter.

Javiermateor commented 2 months ago

@iamamiramine hi, I am concerned about something. Were you able to run the hyperparameter tunning for your dataset with: python scripts/tune_ddpm.py [ds_name] [train_size] synthetic [catboost|mlp] [exp_name] --eval_seeds? I've been trying to do it but there's always something to change and it drives me crazy. Can you tell me if you were able to run the hyperparameter tunning?

iamamiramine commented 2 months ago

@Javiermateor I modified the code to be able to run tune_ddpm.py. However, I ended up using Synthcity. They implemented TabDDPM along with other Tabular Data Models. I suggest you look into their code in here and in here to understand their implementation of TabDDPM. They also implemented hyperparameter search space based on TabDDPM paper.

Javiermateor commented 2 months ago

@iamamiramine I need a heroooooo! haha. Thank you very much for that hint! :) 👍. The raw implementation of TABDDPM is kinda confusing. Would you mind if I ask how exactly did u implement it at the end? or if you have one publication I would be more than glad to read it. In my profile there's my contact data, if you need something, let me know.