This is the official implementation of the paper "TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning" (arXiv).
Table of Contents:
After setting up the environment, use this notebook to browse the main results (for now, you can scroll to the last cell to get an idea of what it looks like).
For this project, we highly recommend using a conda-like environment manager instead of pip to get things right for the libraries that use CUDA, especially for Faiss. The available options:
Then, run the following commands (replace micromamba
with mamba
or conda
if needed):
git clone https://github.com/yandex-research/tabular-dl-tabr
cd tabular-dl-tabr
micromamba create -f environment.yaml
micromamba activate tabr
If the micromamba create
command fails, try using environment-simple.yaml
instead of environment.yaml
.
If your machine does not have GPUs, use environment-simple.yaml
, but replace faiss-gpu
with faiss-cpu
and remove pytorch-cuda
.
(License: we do not impose any new license restrictions in addition to the original licenses of the used dataset. See the paper to learn about the dataset sources)
Navigate to the repository root and run the following commands:
wget https://huggingface.co/datasets/puhsu/tabular-benchmarks/resolve/main/data.tar -O tabular-dl-tabr.tar.gz
tar -xvf tabular-dl-tabr.tar.gz
After that, the data/
directory should appear.
When running scripts, the environment variable CUDA_VISIBLE_DEVICES
must be explicitly set. So we assume that you do run the following command first before running other commands:
export CUDA_VISIBLE_DEVICES="0"
To check that the environment is configured correctly, run the following command and wait for the training to finish (in this experiment, hyperparameters and results are extremely suboptimal, this is needed only to test the environment):
python bin/ffn.py exp/debug/0.toml --force
The last line of the output log should look like this:
[<<<] exp/debug/0 | <date & time>
Here, we reproduce the results for MLP on the California Housing dataset (in the paper, this dataset is referred to as "CA"). Reproducing the results for other algorithms and datasets is very similar with rare exceptions, which are commented in further sections.
The detailed description of the repository is provided later in the "Understanding the repository" section. Until then, simply copying and pasting the instructions should just work.
Technically, reproducing the results for MLP on the California Housing dataset means reproducing the content of these directories:
exp/mlp/california/0-tuning
is the result of the hyperparameter tuningexp/mlp/california/0-evaluation
is the result of evaluation of the tuned configuration from the previous step. This configuration is evaluated under 15 random seeds, which produces 15 single models.exp/mlp/california/0-ensemble-5
is the result of ensembles of the single models from the previous step (three disjoint ensembles each consisting of five models).To reproduce the above results, run the following commands (takes up to 30-60 minutes on a single GPU):
cp exp/mlp/california/0-tuning.toml exp/mlp/california/0-reproduce-tuning.toml
python bin/go.py exp/mlp/california/0-reproduce-tuning.toml
In fact, 0-reproduce-tuning
is an arbitrary name and you can choose a different one, but it must end with -tuning
.
Once the run is finished, the following directories should appear:
exp/mlp/california/0-reproduce-tuning
exp/mlp/california/0-reproduce-evaluation
exp/mlp/california/0-reproduce-ensemble-5
After that, you can go to notebooks/results.ipynb
and view your results (see the instructions just before the last cell of that notebook).
Note that bin/go.py
is just a shortcut and the above commands are equivalent to this:
cp exp/mlp/california/0-tuning.toml exp/mlp/california/0-reproduce-tuning.toml
python bin/tune.py exp/mlp/california/0-reproduce-tuning.toml
python bin/evaluate.py exp/mlp/california/0-reproduce-tuning
python bin/ensemble.py exp/mlp/california/0-reproduce-evaluation
General comments:
notebooks/results.ipynb
covers many (but not all) results from the paper with their locations in exp/
.Evaluating specific configurations without tuning.
To evaluate a specific set of hyperparameters without tuning, you can use bin/go.py
(to evaluate single models and ensembles) or bin/evaluate.py
(to evaluate only single models).
For example, this is how you can reproduce the results for the default XGBoost on the California Housing dataset:
mkdir exp/xgboost_/california/default2-reproduce-evaluation
cp exp/xgboost_/california/default2-evaluation/0.toml exp/xgboost_/california/default2-reproduce-evaluation/0.toml
python bin/go.py exp/xgboost_/california/default2-reproduce-evaluation --function bin.xgboost_.main
Note that now we have to explicitly pass the function that is being evaluated (--function bin.xgboost_.main
).
Again, default2-reproduce-evaluation
is an arbitrary name, the only requirement is that it ends with -evaluation
.
Custom versions of TabR.
In bin/
, there are several versions of the model.
Each of them has a corresponding directory in exp/
with configs and results.
See "Code overview" to learn more.
k Nearest Neighbors. To reproduce the results on the California Housing dataset:
cp exp/neighbors/california/0.toml exp/neighbors/california/0-reproduce.toml
python bin/neighbors.py exp/neighbors/california/0-reproduce.toml
mkdir exp/knn/california/0-reproduce-evaluation
cp exp/knn/california/0-evaluation/0.toml exp/knn/california/0-reproduce-evaluation/0.toml
python -c "
path = 'exp/knn/california/0-reproduce-evaluation/0.toml'
with open(path) as f:
config = f.read()
with open(path, 'w') as f:
f.write(config.replace(
':exp/neighbors/california/0',
':exp/neighbors/california/0-reproduce'
))
"
python bin/knn.py exp/knn/california/0-reproduce-evaluation/0.toml
DNNR.
First, you need to run bin/dnnr_precompute_scaling.py
and obtain results similar to exp/dnnr/precomputed_scaling
("loo" and "ohe" differ only in how the categorical features are encoded; we choose the best of the two approaches on the next step based on the performance on the validation set).
Then, you need to run bin/dnnr.py
, the corresponding configs are located in exp/dnnr/<dataset name>
NPT. To evaluate NPT, we use the official repository with modifications to allow using our datasets and preprocessing.
Read this if you are going to do more experiments/research in this repository.
bin
contains high-level scripts which produce the main results
tabr.py
is the "main" implementation of TabR with many useful technical comments insidetabr_scaling.py
is the version of tabr.py
with the support for the "context freeze" technique described in the papertabr_design.py
is the version of tabr.py
with more options for testing various design decisions and doing ablation studiestabr_add_candidates_after_training.py
is the version of tabr.py
for evaluating the addition of new unseen candidates after the training as described in the paperffn.py
implements the general "feed-forward network" approach (currently, only the MLP backbone is available, but adding new backbones is simple)ft_transformer.py
implements FT-Transformer from the "Revisiting Deep Learning Models for Tabular Data" paperxgboost_.py
implements XGBoostlightgbm_.py
implements LightGBMcatboost_.py
implements CatBoostneighbors.py
+ knn.py
implement k Nearest Neighborsdnnr_precompute_scaling.py
+ dnnr.py
implement DNNR from the "DNNR: Differential Nearest Neighbors Regression" papersaint.py
implements SAINT from the "SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training" paperanp.py
implements the model from the "Attentive Neural Processes" paperdkl.py
implements the model from the "Deep Kernel Learning" papertune.py
tunes hyperparametersevaluate.py
evaluates a given config over multiple (by default, 15) random seedsensemble.py
ensembles predictions produced by evaluate.py
go.py
is a shortcut combining [tune.py + evaluate.py + ensemble.py]
notebooks
contains Jupyter notebookslib
contains common tools used by the scripts in bin
and the notebooks in notebooks
exp
contains experiment configs and results (metrics, tuned configurations, etc.)
bin
, there is a corresponding directory in env
. However, this is just a convention, and you can have any layout in exp
.For most scripts in bin
, the pattern is as follows:
python bin/some_script.py exp/a/b/c.toml
When the run is successfully finished, the result will be the exp/a/b/c
folder.
In particular, the exp/a/b/c/DONE
file will be created.
Usually, the main part of the result is the exp/a/b/c/report.json
file.
If you want to run the script with the same config again and overwrite the existing results, use the --force
flag:
python bin/some_script.py exp/a/b/c.toml --force
Some scripts (bin/tune.py
and bin/go.py
) support the --continue
flag.
The following scripts have command line interface instead of configs:
bin/go.py
bin/evaluate.py
bin/ensemble.py
data
section which describes the input dataset
y_policy = "standard"
unless you are absolutely sure that you need other valuedata
section should be copied from the MLP config for the same dataset. For example, for California Housing dataset, this "source of truth" for deep learning algorithms is the exp/mlp/california/0-tuning.toml
config.lib.dump_config
and lib.load_config
functions (defined in lib/util.py
) instead of bare TOML libraries.lib.get_path
function (defined in lib/env.py
).bin
can be used as modules if needed: import bin.ffn
. For example, this is used by bin/evaluate.py
and bin/tune.py
.To apply the scripts from this repository to your custom dataset, you need to create a new directory in the data/
directory and use the same file names and data types as in our datasets.
A good example is the data/adult
dataset where all supported feature types are presented (numerical, binary and categorical).
The .npy
files are NumPy arrays saved with the np.save
function (documentation).
Let's say your dataset is called my-dataset
.
Then, create the data/my-dataset
directory with the following content:
X_num_train.npy
, X_num_val.npy
, X_num_test.npy
np.float32
X_bin_train.npy
, X_bin_val.npy
, X_bin_test.npy
np.float32
0.0
and 1.0
X_cat_train.npy
, X_cat_val.npy
, X_cat_test.npy
np.str_
(yes, the values must be strings)Y_train.npy
, Y_val.npy
, Y_test.npy
np.float32
for regression, np.int64
for classification[0, ..., n_classes - 1]
.info.json
-- a JSON file with the following keys:
"task_type"
: one of "regression"
, "binclass"
, "multiclass"
"name"
: any string (a "pretty" name for your dataset, e.g. "My Dataset"
)"id"
: any string (must be unique among all "id"
keys of all info.json
files of all datasets in data/
)READY
-- just an empty fileAt this point, your dataset is ready to use!
The "main" metric which is optimized in this repository is referred to as "score". Score is always maximized. By default:
In the _SCORE_SHOULD_BE_MAXIMIZED
dictionary in lib/data.py
, you can find other supported scores.
To use any of them, set the "score" field in the [data]
section of a config:
...
[data]
seed = 0
path = ":data/california"
...
score = "r2"
...
To implement a custom metric, add its name to the _SCORE_SHOULD_BE_MAXIMIZED
dictionary and compute it in the lib/metrics.py:calculate_metrics
function.
We do not provide instructions for that. While adding new task types is definitely possible, overall, the code is written without other task types in mind. For example, there may be places where the code implicitly assumes that the task is either regression or classification. So adding a new task type will require carefully reviewing the whole codebases to find places where the new task type should be taken into account.
@article{gorishniy2023tabr,
title={TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning},
author={
Yury Gorishniy and
Ivan Rubachev and
Nikolay Kartashev and
Daniil Shlenskii and
Akim Kotelnikov and
Artem Babenko
},
journal={arXiv},
volume={2307.14338},
year={2023},
}