Closed SvenGroen closed 1 year ago
Hi, thanks for your interest and questions!
Q1: "AD categorical_feature 3" means that we visualize the 3rd categorical feature from adult dataset (D.X_cat["train"][:, 3]
if you have read the code). The picture itself visualises a histogram (i.e., frequencies of categories in the case of a categorical feature). The higher the bar, the more frequent the category is. We want orange bars (synthetic data) to be as similar as possible to blue bars (real data). It would mean that our model captured the distribution of features well.
Q2: No. I plan to add it but not sure when.
Q3.1: You are right.
Q3.2 It should recreate all results. Otherwise, open the issue. In exp/{ds}/ddpm_cb_best
you can find the final tuned hyperparameters. You can run tune_ddpm.py
with --eval_seeds
flag to run evaluation after the model is tuned.
Q4: We tune hyperparameters of CatBoost model on real data before training TabDDPM. So, we use cross-validation to tune it. You can find hyperparameters in tuned_models/
folder (sorry, naming may be confusing :)). And then we use these hyperparameters to evaluate synthetic data (using eval_seeds.py
). So, eval_seeds.py
takes configs from tuned_models/
to init CatBoost and then we train it on synthetic data and test on real data. The idea is that we find the best CatBoost model on real data and then use it to evaluate synthetic data. Originally, we wanted to use synthetic data in addition to real data and to enhance the performance of the best CatBoost model using these synthetic augmentations (in my code the wordmerged
means it). But it did not work in our experiments.
Feel free to ask me anything if something is still unclear!
Thank you for fast answer! You already helped me a lot :)
I have some follow-up questions: Q1: Was there any particular reason why you have chosen feature numb. 3? Or did you pick it at random?
Q4: I see, so you train on the real dataset first to find out which hyperparameters work best for Catboost. Once you know which hyperparameters are working well, you train a new Catboost model with the same hyperparameters on the synthetic training dataset and test on the real test set and compare it to the previous catboost models performance (the one that was using only real data). By this we eliminate the possibility that the hyperparameters influence the resulting performance of Catboost which would hurt the compatibility of the real-real Catboost with the synthethic-real Catboost model. Correct? :D
And I also have some new Questions:
Q5: Why did you change the validation set (the --change_val flag). Looking at the code, I saw that you use it to re-split the y/x["train"]
and y/x["val"]
(data.py, change_val function).
Did you do this to ensure that you have the same val_size across all datasets (I can imagine, that they probably all have different split sizes), or was there any other reason why you did that?
Q6: in tune_ddpm.py
you find the best model by sampling and evaluating for 5 different seeds and calculating the average score of the 5 runs. I assume you change the "eval_model" to [mlp|catboost] to find the best model hyperparameters for ddpm to synthesize a dataset that either works best with mlp or cb (which are stored in exp/adult/ddpm_mlp_best/ and exp/adult/ddpm_cb_best respectively), correct? And in your final model you used for the paper you just used the catboost evaluation tuning option, correct?
Q7: If you are finding the best hyperparameters in tune_ddpm.py
by evaluating (tune_ddpm.py, line 94) using train_catboost() and train_mlp() in the pipeline.py
, what is the purpose of running eval_seeds.py
, where we also use train_catboost() and train_mlp()?
Isn't eval_seeds.py
doing basically the same sampling and evaluating that was previously done during the training (except this time we just load the best found model). Could we not just take the metric score's from the best found model during the training in the tune_ddpm.py
script? If not, could you point out the difference between the sampling+evaluation part in tune_ddpm.py
and sampling+evaluation part in eval_seeds.py
?
Sorry for the long questions 😄 I am currently writing my master thesis on tabular data synthesis and I am planning to use your code for some further experiments. But I need to make sure to fully understand everything before I do that.
Cheers, Sven
Q1: We have tried to present different types of features on that figure (and different in terms of distribution), so there is nothing special about this feature except that it is categorical feature with many categories.
Q4: Good question. We attempted to tune CatBoost hyperparameters on synthetic data (in private experiments), but performance did not improve. So, yes, we just try to tune DDPM with respect to those "tuned-on-real" Catboost hyperparameters.
Q5: We did a lot of experiments and tested a lot of ideas, so we did not want to overfit on test data. That's why we tune DDPM on a changed validation split. Let's say we have train, val, test
and train2, val2
, which are the changed variants of train, val
. The idea is that in our experiments we tuned DDPM on val2
and tested on val
(we were okay with the possible leakage). But in the final variant the following happens: DDPM is trained on train2
and tuned on val2
score, then we use tuned DDPM hyperparameters to train final DDPM on train
(see, no flag here), then we test it on test
(val
is only used for early stopping for CatBoost, I believe). Nothing to do with the split sizes :)
Q6: Not sure what exactly do you mean, but read Appendix A to understand my further explanation. So, ddpm_[mlp|cb]_best
refers to a type of guidance, i.e. for cb
we use CatBoost score averaged over 5 sampling and 5 eval seeds to guide optuna; ddpm_mlp_best
was only used to show that MLP guidance also works, and CatBoost guidance does not ruin MLP evaluation. The main protocol is CatBoost guidance and CatBoost evaluation.
Q7: First purpose is train/val split from Q5. Second purpose is we use 10 eval seeds and 5 sampling for evaluation (not 5 + 5). And the third is that sometimes we use CatBoost guidance (i.e. CatBosst score during tune_ddpm.py
), but final evaluation is done using MLP. However, you are correct in general. It is mostly done for my convenience :)
No worries, you ask good questions. Glad to help you 😀
Thank you so much for answering my questions! You helped me a lot. I think for now all my questions are answered. But I might come back to you if something new comes up 😄
Hey, 2 quick questions:
was there any particular reason why you chose 50 optuna trials during tuning? was it set randomly, or did you figure out by experiments that this is a good amount of trials?
Do you remember which train size you have used for tuning ddpm for the adult dataset? (python scripts/tune_ddpm.py [ds_name] [train_size] synthetic [catboost|mlp] [exp_name] --eval_seeds) You give an example for the churn2 dataset with a train size of 6500 in the readme – why was this number selected? At the moment, I am just using the same size as the train dataset has.
I am asking because I am currently trying to reduce my tuning time and both parameters affect the overall tuning time :D
Thanks in advance 😄!
Hi, sorry fot the late answer.
"abalone" | "churn" | "insurance" | "diabetes" | "wilt" | "adult" | "king" | "california" | "house" | "default" |
---|---|---|---|---|---|---|---|---|---|
2670 | 6500 | 850 | 500 | 3100 | 27000 | 13800 | 13000 | 13000 | 19200 |
"miniboone" | "cardio" | "higgs-small" |
---|---|---|
83000 | 45000 | 62700 |
@SvenGroen Hi. Your conversations have helped me a lot! I have a question, I want to use this code on my own dataset, how do I go about it? Do I need to write a corresponding config file myself? How did you do it? Is it possible to share the details? Or can you share your complete code? I will be grateful if you can! Thank you!
Hi, First of all: Great work and well written code! It's easy to follow and pretty self-explanatory.
I am trying to recreate your results from the paper and got the code running (only using the Adult Dataset so far). I have a few questions and was wondering if you might be able to help me out:
Can you explain to me, what is shown on Figure 2 in your Paper. Lets take Adult (AD) as an example: Q1: what do you mean with categorical_feature 3?
Q2: Is the code to recreate Figure 2 & 3 also publicly available?
Q3.1: Did I understand the function of your scripts correct:
tune_ddpm.py
is used to train multiple different model versions with different hyperparameter by internally callingpipeline.py
with different configs and storing the best found model in the end. The best found model version can be evaluated over multiple different seeds using theeval_seeds.py
script. Q3.2: Is the above order of script execution correct to recreate your results?Q4: I am not quite sure for what the
tune_evaluation_model.py
script is used? Aren't we training the evaluation models already ineval_seeds.py
?Cheers, Sven