weecology / EvergladesSpeciesModel

A deepforest model for wading bird species prediction.
MIT License
1 stars 1 forks source link

revisit imbalance sampling #21

Closed bw4sz closed 2 years ago

bw4sz commented 2 years ago

@ethanwhite we need to re-assess if the sampling is ideal. The weighted random sampler balances data within an epoch, not within a batch. Therefore the degree of balance has a tradeoff with seeing the same rare samples repeatedly. Check out this example from the trees.

True frequency

>>> train.taxonID.value_counts()
QURU     128
ACRU     122
PIST      48
TSCA      40
BELE      22
BEPAP     17
FRAM2     11
>>> for x in range(10):
...     labels = []
...     for x in dm.train_sampler:
...         individuals, inputs, label = dm.train_ds[x]
...         labels.append(label)
...
...     counts = torch.tensor(labels).unique(return_counts=True)
...     print(counts)
...
(tensor([0, 1, 2, 3, 4, 5, 6]), tensor([52, 57, 57, 66, 59, 52, 45]))
(tensor([0, 1, 2, 3, 4, 5, 6]), tensor([57, 41, 45, 58, 60, 68, 59]))
(tensor([0, 1, 2, 3, 4, 5, 6]), tensor([40, 60, 56, 61, 56, 57, 58]))
(tensor([0, 1, 2, 3, 4, 5, 6]), tensor([62, 51, 63, 50, 53, 52, 57]))
(tensor([0, 1, 2, 3, 4, 5, 6]), tensor([52, 52, 60, 57, 70, 50, 47]))
...

So while we are achieving the balanced part of the goal, we are dropping hundreds of the primary class on an epoch basis. Its possible that over 100s of epochs this just comes out the same, but i am currently chasing sources of (massive) intra-run variance and I think this is a big part. I'm reading and mulling what an ideal situation is, but I think we want the minimum number of samples such that we see every unique data point in every epoch, padded with oversampled rare classes. So in the above case every class should have about 128 points per epoch, but those 128 from the majority class should be basically without replacement. I think this is orbiting the right idea

https://discuss.pytorch.org/t/how-to-enable-the-dataloader-to-sample-from-each-class-with-equal-probability/911

This could all be for nothing, but i'm going to chase it for a couple days.

bw4sz commented 2 years ago

closing, I could not improve the model with an any alternates. There is still something to this, but not worth getting distracted.