Partially solved in ba7aeae, where we do the following:
limit the number of positive samples to 500 (sub-sample if there are more);
limit the number of negative samples to 1,000 (sub-sample if there are more);
Sub-sample negative samples, if there are more than 20 times more negative than positive samples; this is reduced to 3 times more if there was only one sibling (line 346)
We want to have at least 5 times more negative than positive samples. If there are not, then we will pick them from outside the siblings. We choose randomly 5 positive samples and find the most similar samples outside of the siblings, and add those to the negative samples that we have already. Note (line 363): if we are at kingdom level, then it's not possible to add outside of the siblings (and possible_neg = 0).
At the moment, when we train a node, we take all possible genes from positive and negative class. This can result in unbalanced training set, example:
where we have 3 positive classes and 33k negative classes.
We need to improve the function
find_training_genes
increate_db.py
.