yandex-research / rtdl-num-embeddings

(NeurIPS 2022) On Embeddings for Numerical Features in Tabular Deep Learning
https://arxiv.org/abs/2203.05556
MIT License
312 stars 34 forks source link

How to use it to evaluate on other datasets and for other embedding algorithms? #11

Closed hedongyan closed 1 year ago

hedongyan commented 1 year ago

Should I change the dataset into a csv file or excel file or other formats? Which lines or files should I change if I want to use a new dataset and a new embedding algorithms for evaluation while keeping the awesome hyper-parameter tuning mechanisms?

Yura52 commented 1 year ago

How to add new datasets

First, download and unpack the data as described here. You will see the new data/ directory in the repository. In the directory, there are datasets used in the paper.

Then, you have to add your dataset in the data/ directory following the format of other datasets. Let's say your dataset's name is iris. Then you should use np.save and create the directory data/iris with the following content:

Let's say you want to run the tuning & evaluation pipeline for MLP on your dataset. Then copy any existing config (for example, this one) and change the path inside the config to point to your dataset ("data/iris" instead of "data/california").

Full script:

export CUDA_VISIBLE_DEVICES="0"
mkdir exp/mlp/iris
cp exp/mlp/california/0_tuning.toml exp/mlp/iris/0_tuning.toml
<edit the new config as described above>
python bin/tune.py exp/mlp/iris/0_tuning.toml
python bin/evaluate.py exp/mlp/iris/0_tuning 15
python bin/ensemble.py exp/mlp/iris/0_evaluation

How to add new embedding algorithms

I don't understand the question :) You can use bin/train4.py as a starting point.

Yura52 commented 1 year ago

Feel free to reopen the issue if needed.