How to generate `train_v1.txt` for datasets such as Amazon-670k?

yourh / AttentionXML

Implementation for "AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification"

245 stars 41 forks source link

How to generate `train_v1.txt` for datasets such as Amazon-670k? #32

Closed celsofranssa closed 1 year ago

celsofranssa commented 1 year ago

The dataset Amazon-670k config has an additional parameter: sparse: data/Amazon-670k/train_v1.txt, which is not generated from the run_preprocess.sh script.

What is train_v1.txt, and how to generate it?

yourh commented 1 year ago

It's the file of the BOW feature provided by the dataset.

celsofranssa commented 1 year ago

I must generate this BOW feature for different training/testing splits since I am applying k-fold cross-validation. Therefore, please give me directions on how to generate it.

yourh commented 1 year ago

This file and the raw text file data/Amazon-670K/train_texts.txt are corresponding so you can just use the same partition on these files.

celsofranssa commented 1 year ago

This file and the raw text file data/Amazon-670K/train_texts.txt are corresponding so you can just use the same partition on these files.

And how could I do the same to the other folds?

celsofranssa commented 1 year ago

I was able to generate this feature file by combining TfidfVectorizer and dumping it in svmlight format. I hope that's correct.