Unexpected drop in accuracy

PetervanLunteren commented 10 months ago

First of all: thanks for building this awesome repo! This is really helpful. I am running a training with the default settings, except for N_SAMPLES=35000. At some point during stage 1/3, the accuracy drops from 0.94 to 0.03 in one epoch. Any idea what is going on? See the console output below.

Saving class list to class_list.yaml
Dataset: Training Data
Number of images in the dataset: 875000
zebra                    35000
hyrax                    35000
gemsbok                  35000
cattle                   35000
rhinoceros               35000
porcupine                35000
springbok                35000
spotted hyaena           35000
elephant                 35000
caracal                  35000
baboon                   35000
klipspringer+steenbok    35000
hare                     35000
mongoose                 35000
kudu                     35000
brown hyaena             35000
cheetah                  35000
ostrich                  35000
leopard                  35000
bird                     35000
canid                    35000
african wild cat         35000
other                    35000
lion                     35000
giraffe                  35000
Name: Label, dtype: int64

Number of classes: 25
Dataset: Validation Data
Number of images in the dataset: 87608
cattle                   29470
elephant                 12575
giraffe                   6589
gemsbok                   5555
bird                      4714
african wild cat          3795
zebra                     3713
lion                      3472
hare                      3422
spotted hyaena            2953
ostrich                   2291
baboon                    2195
springbok                 1780
cheetah                    911
canid                      830
other                      819
brown hyaena               804
kudu                       565
rhinoceros                 330
mongoose                   238
porcupine                  203
leopard                    161
klipspringer+steenbok      158
caracal                     44
hyrax                       21
Name: Label, dtype: int64

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/efficientnet_v2/efficientnetv2-b0_notop.h5
24274472/24274472 [==============================] - 4s 0us/step
Epoch 1/15
218750/218750 [==============================] - 4416s 20ms/step - loss: 0.1289 - accuracy: 0.8160 - val_loss: 0.0899 - val_accuracy: 0.8761
Epoch 2/15
218750/218750 [==============================] - 4444s 20ms/step - loss: 0.0976 - accuracy: 0.8560 - val_loss: 0.0815 - val_accuracy: 0.8869
Epoch 3/15
218750/218750 [==============================] - 4426s 20ms/step - loss: 0.0888 - accuracy: 0.8682 - val_loss: 0.0805 - val_accuracy: 0.8882
Epoch 4/15
218750/218750 [==============================] - 4421s 20ms/step - loss: 0.0842 - accuracy: 0.8749 - val_loss: 0.0774 - val_accuracy: 0.8936
Epoch 5/15
218750/218750 [==============================] - 4429s 20ms/step - loss: 0.0816 - accuracy: 0.8787 - val_loss: 0.0767 - val_accuracy: 0.8947
Epoch 6/15
218750/218750 [==============================] - 4436s 20ms/step - loss: 0.0794 - accuracy: 0.8820 - val_loss: 0.0753 - val_accuracy: 0.8969
Epoch 7/15
218750/218750 [==============================] - 4415s 20ms/step - loss: 0.0777 - accuracy: 0.8845 - val_loss: 0.0726 - val_accuracy: 0.9004
Epoch 8/15
218750/218750 [==============================] - 4421s 20ms/step - loss: 0.0767 - accuracy: 0.8858 - val_loss: 0.0740 - val_accuracy: 0.8993
Epoch 9/15
218750/218750 [==============================] - 4419s 20ms/step - loss: 0.0758 - accuracy: 0.8875 - val_loss: 0.0746 - val_accuracy: 0.8987
Epoch 10/15
218750/218750 [==============================] - 4405s 20ms/step - loss: 0.0751 - accuracy: 0.8885 - val_loss: 0.0736 - val_accuracy: 0.8988
Epoch 11/15
218750/218750 [==============================] - 4407s 20ms/step - loss: 0.0749 - accuracy: 0.8891 - val_loss: 0.0760 - val_accuracy: 0.8962
Epoch 12/15
218750/218750 [==============================] - 4423s 20ms/step - loss: 0.0745 - accuracy: 0.8901 - val_loss: 0.0758 - val_accuracy: 0.8971
Epoch 13/15
218750/218750 [==============================] - 4405s 20ms/step - loss: 0.0743 - accuracy: 0.8903 - val_loss: 0.0760 - val_accuracy: 0.8974
Epoch 14/15
218750/218750 [==============================] - 4420s 20ms/step - loss: 0.0737 - accuracy: 0.8913 - val_loss: 0.0747 - val_accuracy: 0.8980
Epoch 15/15
218750/218750 [==============================] - 4412s 20ms/step - loss: 0.0738 - accuracy: 0.8911 - val_loss: 0.0741 - val_accuracy: 0.9000
Total trainable base-model layers: 91
21902/21902 [==============================] - 228s 10ms/step - loss: 0.0593 - accuracy: 0.9260
Frozen model: test loss, test acc: [0.05925498157739639, 0.9259656667709351]
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 efficientnetv2-b0 (Function  (None, 1280)             5919312
 al)

 base_dropout (Dropout)      (None, 1280)              0

 compression_bottleneck (Den  (None, 256)              327936
 se)

 top_dropout (Dropout)       (None, 256)               0

 classification (Dense)      (None, 25)                6425

=================================================================
Total params: 6,253,673
Trainable params: 5,658,061
Non-trainable params: 595,612
_________________________________________________________________
100%|██████████| 50/50 [07:59<00:00,  9.59s/it]
>>>> Stage: 1/3, target_shape: 224, dropout: 0.1, magnitude: 5, batch_size: 4
>>>> This stage runs from epoch 1 to 10 out of 50 total epochs
218750/218750 [==============================] - 7233s 33ms/step - loss: 0.0516 - accuracy: 0.9245 - val_loss: 0.0504 - val_accuracy: 0.9375 - lr: 2.4950e-05
21902/21902 [==============================] - 235s 11ms/step - loss: 0.0440 - accuracy: 0.9472
New best-performing epoch of model (size = 224px) saved as: /data/output/mewc_model_224px_best.h5
Epoch 2/2
218750/218750 [==============================] - 7240s 33ms/step - loss: 1.6605 - accuracy: 0.8278 - val_loss: 14.7829 - val_accuracy: 0.0332 - lr: 5.0000e-05
21902/21902 [==============================] - 229s 10ms/step - loss: 14.7626 - accuracy: 0.0346
Epoch 3/3
 16634/218750 [=>............................] - ETA: 1:46:36 - loss: 14.6078 - accuracy: 0.0395

When running a training on your example dataset with N_SAMPLES=4000, it worked perfectly. It also worked well when I trained on my own dataset with N_SAMPLES=10000.

I have a dataset of about 900.000 images in total, ranging from classes with a few 100 images to classes with more than 100.000 images. Hence, I prefer to not downsample my dataset too much... Or am I misinterpreting N_SAMPLES? I assumed that it upsampled small classes and downsampled large classes.

Or do I need to adjust other default values too, if I'm training with a large value for N_SAMPLES?

Thanks in advance :)

PetervanLunteren commented 10 months ago

I just realised that I pulled the latest image (zaandahl/mewc-train) instead of version 1.0, like the documentation says (zaandahl/mewc-train:v1.0). Could this have affect?

BTW: unrelated, but I believe there is a typo in the documentation, as the tag for version 1.0 is actually 1.0 instead of v1.0.

zaandahl commented 10 months ago

Hi Peter,

I haven't tested with a very high number of samples, and I'm not sure why the accuracy drops. It might be worth gradually increasing the number of samples from 10000 to see where the problem occurs. The learning rate looks like it changes drastically at the stage where accuracy is lost so it might be worth testing a different schedules for magnitudes and/or dropouts.

The N_SAMPLES parameter samples with replacement up to the value, so sparse classes are oversampled.

I'll adjust the documentation with the version. The GitHub tag is v1.0.x but DockerHub uses 1.0.x so I probably got the two confused. :)

Cheers, Zach

PetervanLunteren commented 10 months ago

Hi Zach,

It did the same thing with N_SAMPLES=15000 a bit further down. I'll try some different approaches. Thanks!

Cheers,

Peter

zaandahl / mewc-train

Unexpected drop in accuracy #2