How to prepare a new dataset for use?

classopen24 commented 3 years ago

Thanks for sharing this as open source! I am trying to figure out how to convert a new dataset into the format required for the models, and I can't seem to find code for that in the repo. For example, to take a csv file and then convert it into the right files required to input to the model. Could you please help?

Yura52 commented 3 years ago

We do not have code that transforms any dataset to the required format, we did it for each dataset separately. So, the best thing you can do is to take our archive and transform your dataset to the format that is used in the archive, namely:

N_train|val|test.npy - numerical features (np.float32)
C_train|val|test.npy - categorical features (str)
y_train|val|test.npy - target (np.float32 for regression, np.int64 for classification within the range(0, n_classes))
info.json - some information about the dataset (see the archive to learn what fields are required)

We usually inherited splits when they were available. Otherwise, we used:

sklearn.model_selection.train_test_split (with a stratify argument for classification problems) with test_size=0.2 to split all into trainval and test
sklearn.model_selection.train_test_split (with a stratify argument for classification problems) with test_size=0.2 (yes, same as above) to split trainval into train and val

classopen24 commented 3 years ago

I see - I should be able to create the N/C/y myself thanks to your help. I think only a couple of other pieces remain.

When looking at the dataset (from the Dropbox link), I see a "idx_test.npy" file in all the folders, and I dont know how this is used. Would you please clarify ?
In the info.json, what is the purpose of name vs basename? and what are the "split" field value options?. All the other options seem to map directly to items in the code, but im not sure about these three.

Yura52 commented 3 years ago

When looking at the dataset (from the Dropbox link), I see a "idx_test.npy" file in all the folders, and I dont know how this is used. Would you please clarify ?

"idx"-files are not used (I think they are saved just as an additional result of manual splitting or if they were available in original sources)

In the info.json, what is the purpose of name vs basename? and what are the "split" field value options?. All the other options seem to map directly to items in the code, but im not sure about these three.

You can just set split=0,basename=<your dataset name> and name={basename}___{split}. (Initially, the idea was that multiple splits could exist and dataset folders had names like "adult0", but we used just one split for each dataset and removed the "0" suffix from all folder names for simplicity).

classopen24 commented 3 years ago

Ok, will reopen if I run into any issues. Thanks.

Wajih88 commented 2 years ago

Hi, I was wondering after the dataset preparation , what are the scripts that should be run to get the results ? I see that the tuning needs a toml file. In fact how do we get it for a new dataset ? Thanks for your help

Yura52 commented 2 years ago

Hi! I recommend going through the tutorial on reproducing results. Compared to the tutorial, the only change that is needed for new datasets is changing the path in tuning config. For example, you if you copy this config, then you will have to replace path = 'data/california_housing' with path = 'data/<your dataset name>'.

Note that the parameter space for tuning may need some adjustments for datasets of other "scale" (compared to the ones in this repository) for obtaining the best results.

Wajih88 commented 2 years ago

thank you for your answer, If I understand correctly, the config file will take those hyper parameters and along with optuna will try the retrieve the best ones ? (please correctly me if I'm wrong) Another question if you may : you mentioned in your answer adjustments on "scale". Did you use some rule based or heuristics in this phase ?

Yura52 commented 2 years ago

will try the retrieve the best ones ?

Yes, exactly. But you still can launch, say, bin/mlp.py with hyperparameters of your choice without tuning.

adjustments on "scale"

I mean that if your dataset size is significantly different from the ones in the repository (for example, it contains only hundreds of objects, or, the opposite, hundreds of millions of objects), then you may need to adjust the config. Otherwise, you can just copy the config for the dataset that is most similar to your dataset and use it as is.

g1644222 commented 2 years ago

Hello. I have a problem with Yahoo and microsoft dataset.

It is how to use q_train/test/val.npy. I have checked the program and there is no indication that q_train/test/val.npy is being used. How are you using it?

Yura52 commented 2 years ago

Hello.

The q_<...>.npy arrays contain query identifiers for the original ranking problems. In our project, we treat ranking problems as regression problems, so the identifiers are not used. Note that the identifiers in train, validation and test parts do NOT intersect and this is a strict requirement for new ranking datasets as well. For example:

In [15]: import numpy as np

In [16]: a = np.load('data/microsoft/q_train.npy')

In [17]: b = np.load('data/microsoft/q_val.npy')

In [18]: c = np.load('data/microsoft/q_test.npy')

In [19]: set(a) & set(b)
Out[19]: set()

In [20]: set(a) & set(c)
Out[20]: set()

In [21]: set(b) & set(c)
Out[21]: set()

g1644222 commented 2 years ago

Thank you very much for your answer. Does this mean that the q_train/test/val.npy data will not be loaded during training? Also, is there anything else I should be aware of when creating the data?

Yura52 commented 2 years ago

Does this mean that the q_train/test/val.npy data will not be loaded during training?

Yes, this is correct.

Also, is there anything else I should be aware of when creating the data?

It should be enough to follow the format of other datasets (including data types). This comment can be helpful. Feel free to ask questions in case of any problems.

yandex-research / rtdl-revisiting-models

How to prepare a new dataset for use? #3