yandex-research / tab-ddpm

[ICML 2023] The official implementation of the paper "TabDDPM: Modelling Tabular Data with Diffusion Models"
https://arxiv.org/abs/2209.15421
MIT License
370 stars 81 forks source link

Question to Paper Figures and Script execution order #8

Closed SvenGroen closed 1 year ago

SvenGroen commented 1 year ago

Hi, First of all: Great work and well written code! It's easy to follow and pretty self-explanatory.

I am trying to recreate your results from the paper and got the code running (only using the Adult Dataset so far). I have a few questions and was wondering if you might be able to help me out:

Can you explain to me, what is shown on Figure 2 in your Paper. Lets take Adult (AD) as an example: Q1: what do you mean with categorical_feature 3?

Q2: Is the code to recreate Figure 2 & 3 also publicly available?

Q3.1: Did I understand the function of your scripts correct: tune_ddpm.py is used to train multiple different model versions with different hyperparameter by internally calling pipeline.py with different configs and storing the best found model in the end. The best found model version can be evaluated over multiple different seeds using the eval_seeds.py script. Q3.2: Is the above order of script execution correct to recreate your results?

Q4: I am not quite sure for what the tune_evaluation_model.py script is used? Aren't we training the evaluation models already in eval_seeds.py?

Cheers, Sven

rotot0 commented 1 year ago

Hi, thanks for your interest and questions!

Q1: "AD categorical_feature 3" means that we visualize the 3rd categorical feature from adult dataset (D.X_cat["train"][:, 3] if you have read the code). The picture itself visualises a histogram (i.e., frequencies of categories in the case of a categorical feature). The higher the bar, the more frequent the category is. We want orange bars (synthetic data) to be as similar as possible to blue bars (real data). It would mean that our model captured the distribution of features well.

Q2: No. I plan to add it but not sure when.

Q3.1: You are right.

Q3.2 It should recreate all results. Otherwise, open the issue. In exp/{ds}/ddpm_cb_best you can find the final tuned hyperparameters. You can run tune_ddpm.py with --eval_seeds flag to run evaluation after the model is tuned.

Q4: We tune hyperparameters of CatBoost model on real data before training TabDDPM. So, we use cross-validation to tune it. You can find hyperparameters in tuned_models/ folder (sorry, naming may be confusing :)). And then we use these hyperparameters to evaluate synthetic data (using eval_seeds.py). So, eval_seeds.py takes configs from tuned_models/ to init CatBoost and then we train it on synthetic data and test on real data. The idea is that we find the best CatBoost model on real data and then use it to evaluate synthetic data. Originally, we wanted to use synthetic data in addition to real data and to enhance the performance of the best CatBoost model using these synthetic augmentations (in my code the wordmerged means it). But it did not work in our experiments.

Feel free to ask me anything if something is still unclear!

SvenGroen commented 1 year ago

Thank you for fast answer! You already helped me a lot :)

I have some follow-up questions: Q1: Was there any particular reason why you have chosen feature numb. 3? Or did you pick it at random?

Q4: I see, so you train on the real dataset first to find out which hyperparameters work best for Catboost. Once you know which hyperparameters are working well, you train a new Catboost model with the same hyperparameters on the synthetic training dataset and test on the real test set and compare it to the previous catboost models performance (the one that was using only real data). By this we eliminate the possibility that the hyperparameters influence the resulting performance of Catboost which would hurt the compatibility of the real-real Catboost with the synthethic-real Catboost model. Correct? :D

And I also have some new Questions: Q5: Why did you change the validation set (the --change_val flag). Looking at the code, I saw that you use it to re-split the y/x["train"] and y/x["val"] (data.py, change_val function). Did you do this to ensure that you have the same val_size across all datasets (I can imagine, that they probably all have different split sizes), or was there any other reason why you did that?

Q6: in tune_ddpm.py you find the best model by sampling and evaluating for 5 different seeds and calculating the average score of the 5 runs. I assume you change the "eval_model" to [mlp|catboost] to find the best model hyperparameters for ddpm to synthesize a dataset that either works best with mlp or cb (which are stored in exp/adult/ddpm_mlp_best/ and exp/adult/ddpm_cb_best respectively), correct? And in your final model you used for the paper you just used the catboost evaluation tuning option, correct?

Q7: If you are finding the best hyperparameters in tune_ddpm.py by evaluating (tune_ddpm.py, line 94) using train_catboost() and train_mlp() in the pipeline.py, what is the purpose of running eval_seeds.py, where we also use train_catboost() and train_mlp()? Isn't eval_seeds.py doing basically the same sampling and evaluating that was previously done during the training (except this time we just load the best found model). Could we not just take the metric score's from the best found model during the training in the tune_ddpm.py script? If not, could you point out the difference between the sampling+evaluation part in tune_ddpm.py and sampling+evaluation part in eval_seeds.py?

Sorry for the long questions 😄 I am currently writing my master thesis on tabular data synthesis and I am planning to use your code for some further experiments. But I need to make sure to fully understand everything before I do that.

Cheers, Sven

rotot0 commented 1 year ago

Q1: We have tried to present different types of features on that figure (and different in terms of distribution), so there is nothing special about this feature except that it is categorical feature with many categories.

Q4: Good question. We attempted to tune CatBoost hyperparameters on synthetic data (in private experiments), but performance did not improve. So, yes, we just try to tune DDPM with respect to those "tuned-on-real" Catboost hyperparameters.

Q5: We did a lot of experiments and tested a lot of ideas, so we did not want to overfit on test data. That's why we tune DDPM on a changed validation split. Let's say we have train, val, test and train2, val2, which are the changed variants of train, val. The idea is that in our experiments we tuned DDPM on val2 and tested on val (we were okay with the possible leakage). But in the final variant the following happens: DDPM is trained on train2 and tuned on val2 score, then we use tuned DDPM hyperparameters to train final DDPM on train (see, no flag here), then we test it on test (val is only used for early stopping for CatBoost, I believe). Nothing to do with the split sizes :)

Q6: Not sure what exactly do you mean, but read Appendix A to understand my further explanation. So, ddpm_[mlp|cb]_best refers to a type of guidance, i.e. for cb we use CatBoost score averaged over 5 sampling and 5 eval seeds to guide optuna; ddpm_mlp_best was only used to show that MLP guidance also works, and CatBoost guidance does not ruin MLP evaluation. The main protocol is CatBoost guidance and CatBoost evaluation.

Q7: First purpose is train/val split from Q5. Second purpose is we use 10 eval seeds and 5 sampling for evaluation (not 5 + 5). And the third is that sometimes we use CatBoost guidance (i.e. CatBosst score during tune_ddpm.py), but final evaluation is done using MLP. However, you are correct in general. It is mostly done for my convenience :)

No worries, you ask good questions. Glad to help you 😀

SvenGroen commented 1 year ago

Thank you so much for answering my questions! You helped me a lot. I think for now all my questions are answered. But I might come back to you if something new comes up 😄

SvenGroen commented 1 year ago

Hey, 2 quick questions:

  1. was there any particular reason why you chose 50 optuna trials during tuning? was it set randomly, or did you figure out by experiments that this is a good amount of trials?

  2. Do you remember which train size you have used for tuning ddpm for the adult dataset? (python scripts/tune_ddpm.py [ds_name] [train_size] synthetic [catboost|mlp] [exp_name] --eval_seeds) You give an example for the churn2 dataset with a train size of 6500 in the readme – why was this number selected? At the moment, I am just using the same size as the train dataset has.

I am asking because I am currently trying to reduce my tuning time and both parameters affect the overall tuning time :D

Thanks in advance 😄!

rotot0 commented 1 year ago

Hi, sorry fot the late answer.

  1. The higher the number of trials the better. But it is quite computationally expensive, so we stuck with this number.
  2. You can use train sizes (and it's better IMO). But I used these numbers in my experiments
"abalone" "churn" "insurance" "diabetes" "wilt" "adult" "king" "california" "house" "default"
2670 6500 850 500 3100 27000 13800 13000 13000 19200
"miniboone" "cardio" "higgs-small"
83000 45000 62700
JiangLei1012 commented 3 months ago

@SvenGroen Hi. Your conversations have helped me a lot! I have a question, I want to use this code on my own dataset, how do I go about it? Do I need to write a corresponding config file myself? How did you do it? Is it possible to share the details? Or can you share your complete code? I will be grateful if you can! Thank you!