Closed classopen24 closed 3 years ago
We do not have code that transforms any dataset to the required format, we did it for each dataset separately. So, the best thing you can do is to take our archive and transform your dataset to the format that is used in the archive, namely:
N_train|val|test.npy
- numerical features (np.float32)C_train|val|test.npy
- categorical features (str)y_train|val|test.npy
- target (np.float32 for regression, np.int64 for classification within the range(0, n_classes))info.json
- some information about the dataset (see the archive to learn what fields are required)We usually inherited splits when they were available. Otherwise, we used:
sklearn.model_selection.train_test_split
(with a stratify
argument for classification problems) with test_size=0.2
to split all
into trainval
and test
sklearn.model_selection.train_test_split
(with a stratify
argument for classification problems) with test_size=0.2
(yes, same as above) to split trainval
into train
and val
I see - I should be able to create the N/C/y myself thanks to your help. I think only a couple of other pieces remain.
When looking at the dataset (from the Dropbox link), I see a "idx_test.npy" file in all the folders, and I dont know how this is used. Would you please clarify ?
"idx"-files are not used (I think they are saved just as an additional result of manual splitting or if they were available in original sources)
In the info.json, what is the purpose of name vs basename? and what are the "split" field value options?. All the other options seem to map directly to items in the code, but im not sure about these three.
You can just set split=0
,basename=<your dataset name>
and name={basename}___{split}
. (Initially, the idea was that multiple splits could exist and dataset folders had names like "adult0", but we used just one split for each dataset and removed the "0" suffix from all folder names for simplicity).
Ok, will reopen if I run into any issues. Thanks.
Hi, I was wondering after the dataset preparation , what are the scripts that should be run to get the results ? I see that the tuning needs a toml file. In fact how do we get it for a new dataset ? Thanks for your help
Hi! I recommend going through the tutorial on reproducing results. Compared to the tutorial, the only change that is needed for new datasets is changing the path in tuning config. For example, you if you copy this config, then you will have to replace path = 'data/california_housing'
with path = 'data/<your dataset name>'
.
Note that the parameter space for tuning may need some adjustments for datasets of other "scale" (compared to the ones in this repository) for obtaining the best results.
thank you for your answer, If I understand correctly, the config file will take those hyper parameters and along with optuna will try the retrieve the best ones ? (please correctly me if I'm wrong) Another question if you may : you mentioned in your answer adjustments on "scale". Did you use some rule based or heuristics in this phase ?
will try the retrieve the best ones ?
Yes, exactly. But you still can launch, say, bin/mlp.py
with hyperparameters of your choice without tuning.
adjustments on "scale"
I mean that if your dataset size is significantly different from the ones in the repository (for example, it contains only hundreds of objects, or, the opposite, hundreds of millions of objects), then you may need to adjust the config. Otherwise, you can just copy the config for the dataset that is most similar to your dataset and use it as is.
Hello. I have a problem with Yahoo and microsoft dataset.
It is how to use q_train/test/val.npy. I have checked the program and there is no indication that q_train/test/val.npy is being used. How are you using it?
Hello.
The q_<...>.npy
arrays contain query identifiers for the original ranking problems. In our project, we treat ranking problems as regression problems, so the identifiers are not used. Note that the identifiers in train, validation and test parts do NOT intersect and this is a strict requirement for new ranking datasets as well. For example:
In [15]: import numpy as np
In [16]: a = np.load('data/microsoft/q_train.npy')
In [17]: b = np.load('data/microsoft/q_val.npy')
In [18]: c = np.load('data/microsoft/q_test.npy')
In [19]: set(a) & set(b)
Out[19]: set()
In [20]: set(a) & set(c)
Out[20]: set()
In [21]: set(b) & set(c)
Out[21]: set()
Thank you very much for your answer. Does this mean that the q_train/test/val.npy data will not be loaded during training? Also, is there anything else I should be aware of when creating the data?
Does this mean that the q_train/test/val.npy data will not be loaded during training?
Yes, this is correct.
Also, is there anything else I should be aware of when creating the data?
It should be enough to follow the format of other datasets (including data types). This comment can be helpful. Feel free to ask questions in case of any problems.
Thanks for sharing this as open source! I am trying to figure out how to convert a new dataset into the format required for the models, and I can't seem to find code for that in the repo. For example, to take a csv file and then convert it into the right files required to input to the model. Could you please help?