I have implemented the options "context", "ignore-label", and "map-label" when reading any data format.
Suppose you have a "train.pfile" with 10 classes (0~9), and you want to do the following things:
* Treat the classes 3,4,5 as one class, and class 6 as the other;
* Train a classifier for the two classes defined above, and ignore all other classes;
* Pad all the features with 5 frames on both sides.
You can specify --train-data "train.pfile,context=5,ignore-label=0-2:7-9,map-label=3-5:0/6:1" to achieve what you want.
Here, "context=5" can be replaced by "context=5:5" (specifying the left and right contexts separately), or "lcxt=5,rcxt=5" (as you originally supported for Kaldi feature files).
The usage of punctuation marks is rather messy, but it can be summarized as:
* Commas are used to separate options;
* Colons are used to separate numbers in values of options;
But in "map-label", slashes are used to separate mappings, and colons are used to separate the original and mapped labels;
* Dashes are used to denote a range of labels.
I tried to make my implementation compatible with everything pre-existing (e.g. both stream and non-stream mode of pfile reading). I have tested my implementation with pickle files and pfiles, but not with Kaldi files; if you have some Kaldi files, you may test it out.
Hi Yajie,
I have implemented the options "context", "ignore-label", and "map-label" when reading any data format.
Suppose you have a "train.pfile" with 10 classes (0~9), and you want to do the following things: * Treat the classes 3,4,5 as one class, and class 6 as the other; * Train a classifier for the two classes defined above, and ignore all other classes; * Pad all the features with 5 frames on both sides. You can specify --train-data "train.pfile,context=5,ignore-label=0-2:7-9,map-label=3-5:0/6:1" to achieve what you want.
Here, "context=5" can be replaced by "context=5:5" (specifying the left and right contexts separately), or "lcxt=5,rcxt=5" (as you originally supported for Kaldi feature files).
The usage of punctuation marks is rather messy, but it can be summarized as: * Commas are used to separate options; * Colons are used to separate numbers in values of options;
But in "map-label", slashes are used to separate mappings, and colons are used to separate the original and mapped labels; * Dashes are used to denote a range of labels.
I tried to make my implementation compatible with everything pre-existing (e.g. both stream and non-stream mode of pfile reading). I have tested my implementation with pickle files and pfiles, but not with Kaldi files; if you have some Kaldi files, you may test it out.