Have more balanced classes for training

zellerlab / stag

A hierarchical taxonomic classifier for metagenomic sequences

8 stars 2 forks source link

At the moment, when we train a node, we take all possible genes from positive and negative class. This can result in unbalanced training set, example:

[2020-04-16 18:17:28,615]    TRAIN:"1729712 Candidatus Fermentibacteria":Find genes
[2020-04-16 18:17:28,639]       SEL_GENES:"1729712 Candidatus Fermentibacteria": 3 positive, 33086 negative
[2020-04-16 18:17:28,639]          TRAIN:"1729712 Candidatus Fermentibacteria":Train classifier

where we have 3 positive classes and 33k negative classes.

We need to improve the function find_training_genes in create_db.py.

Partially solved in ba7aeae, where we do the following:

limit the number of positive samples to 500 (sub-sample if there are more);
limit the number of negative samples to 1,000 (sub-sample if there are more);
Sub-sample negative samples, if there are more than 20 times more negative than positive samples; this is reduced to 3 times more if there was only one sibling (line 346)
We want to have at least 5 times more negative than positive samples. If there are not, then we will pick them from outside the siblings. We choose randomly 5 positive samples and find the most similar samples outside of the siblings, and add those to the negative samples that we have already. Note (line 363): if we are at kingdom level, then it's not possible to add outside of the siblings (and possible_neg = 0).

Can we do better?

zellerlab / stag

Have more balanced classes for training #8