@ethanwhite we need to re-assess if the sampling is ideal. The weighted random sampler balances data within an epoch, not within a batch. Therefore the degree of balance has a tradeoff with seeing the same rare samples repeatedly. Check out this example from the trees.
True frequency
>>> train.taxonID.value_counts()
QURU 128
ACRU 122
PIST 48
TSCA 40
BELE 22
BEPAP 17
FRAM2 11
So while we are achieving the balanced part of the goal, we are dropping hundreds of the primary class on an epoch basis. Its possible that over 100s of epochs this just comes out the same, but i am currently chasing sources of (massive) intra-run variance and I think this is a big part. I'm reading and mulling what an ideal situation is, but I think we want the minimum number of samples such that we see every unique data point in every epoch, padded with oversampled rare classes. So in the above case every class should have about 128 points per epoch, but those 128 from the majority class should be basically without replacement. I think this is orbiting the right idea
@ethanwhite we need to re-assess if the sampling is ideal. The weighted random sampler balances data within an epoch, not within a batch. Therefore the degree of balance has a tradeoff with seeing the same rare samples repeatedly. Check out this example from the trees.
True frequency
So while we are achieving the balanced part of the goal, we are dropping hundreds of the primary class on an epoch basis. Its possible that over 100s of epochs this just comes out the same, but i am currently chasing sources of (massive) intra-run variance and I think this is a big part. I'm reading and mulling what an ideal situation is, but I think we want the minimum number of samples such that we see every unique data point in every epoch, padded with oversampled rare classes. So in the above case every class should have about 128 points per epoch, but those 128 from the majority class should be basically without replacement. I think this is orbiting the right idea
https://discuss.pytorch.org/t/how-to-enable-the-dataloader-to-sample-from-each-class-with-equal-probability/911
This could all be for nothing, but i'm going to chase it for a couple days.