Age accuracy issues - For both IMDB and UTK

idanh commented 6 years ago

Hi,

I've been doing some simple analysis over both pre-trained models, validating using appa-real dataset. Here's what I'm doing:

For each cropped and rotated image i.e. (_face.jpg) prefix:

Resize it to 256x256
Create 2 copies of each image, one is gray scaled and the other RGB
For each of those, I pass them through dlib frontal face detector,
If needed, for UTK I crop the face (margin=0) and with IMDB I'm giving it some margin of .4 (default in your code), for the margin part I did grid search and found that those values work (also noted in your documentations and I'm assuming after seeing UTKFaces that that's the way the models were trained)
At prediction step the image is begin resized to 64x64, this part is a copy-paste from demo.py code

Note that I'm using real_age as supplied by appa dataset and not the expected value of apperent_age. In total I've sampled 1978 images, I see that:

Both models are struggling with younger and older ages, imdb handles it a bit better for some reasons (which I still yet to check why, I have assumptions tho.) and the range I'm getting from both is around 12-68 where the domain of my ages are 0 to 100.
Using gray scaled images, even tho at least with UTK I'm assuming was trained with color images, is giving better results.

I'm attaching some plots I've done, where my goal is to find if either I'm doing something incorrectly or maybe the accuracy is just off a bit? Maybe best to re-train on another dataset?

screen shot 2018-07-29 at 14 35 37

screen shot 2018-07-29 at 14 35 41

screen shot 2018-07-29 at 14 35 46

screen shot 2018-07-29 at 14 35 49

Again, thank you @yu4u for this project, I'm assuming I'm at fault here. Would love to get your feedback.

idanh commented 6 years ago

I'd also add that if it's not apparent from the plots, both models are under-predicting. sometimes by 20 years margin. It's getting worse as the age increases. I can plot the residuals to show the effect, but I think plots above are showing this clearly. Let me know if I can help by sharing more data.

yu4u commented 6 years ago

Thank you for your useful results! I'm surprised to see 1) that gray-scaled images works better and 2) that the model trained on the utk dataset does not perform better than the model trained on the imdb dataset.

Let me confirm the following description:

Create 2 copies of each image, one is gray scaled and the other RGB

the input to the model should be BGR images (not RGB) because the training images are loaded via OpenCV. If you are feeding RGB images, changing the channel order might solve problem.

Both models are struggling with younger and older ages This may be caused by 1) distribution of age on training dataset and 2) the estimated age calculated as expectation (sum{age} age*P(age)) instead of argmax{age} P(age).

idanh commented 6 years ago

@yu4u thank you! I was surprised as well from the results so I posted here to get your feedback, and I applied your suggestions (regrading argmax, I don't know why I didn't check that first.. :))

Regrading distribution of age on training dataset, for IMDB my assumption was that it may pose an issue, but with UTK faces I thought that those are distributed evenly? If not, I'd guess the model is biased. Did you get the chance to train with APPA-REAL?

In any case, I applied your suggestion and you were correct in debugging me with BGR images (I'm assuming same input for both models as they were trained using same code?)

Here are my results still using appa-real for validation:

As you can tell the results are slightly better with BGR images over our best gray scaled results from before.

Using argmax:

I'd say more variance with argmax. Using expected value seems to be the correct way. Would you think there is any more improvement in image cropping or anything I should take a look at?

idanh commented 6 years ago

Hey, some more updates.. To verify on which age groups the IMDB model is mostly wrong on his current iteration (after above fixes) I did the following:

For each prediction, calculate the IQR of the discrete RV and use median ± iqr * wanted_confidence to get an age-range estimation, and from there I calculated 2 things:

Accuracy within age group
Average 'distance' between high and low (high-low)

The following results show what I got on the same dataset appa-real that I'm using with the same images, where wanted_confidence = .9:

Read as: (From Age Inclusive, To Age Inclusive): (accuracy, mean distance around high-low points)

(0, 18): (38.94472361809046, 17.33668341708543),
(0, 30): (65.29100529100529, 17.288888888888888),
(0, 100): (65.3094026252344, 19.60567907848915),
(18, 30): (81.71140939597315, 17.18959731543624),
(18, 40): (81.30165289256198, 17.674586776859503),
(20, 30): (85.76923076923076, 17.33269230769231),
(20, 40): (83.6322869955157, 17.79932735426009),
(30, 40): (81.30530973451327, 18.601769911504423),
(30, 99): (66.84972541188218, 21.767348976535196),
(40, 50): (68.21428571428572, 21.239285714285714)

You can see that tho from 20 to 30 the model works best, it has high deviation of 17 years on average. Assigning wanted_confidence to .5 to get a lower deviation we see that:

(0, 18): (14.949748743718594, 9.57286432160804),
(0, 30): (41.95767195767196, 9.612698412698412),
(0, 100): (42.72702919903563, 10.907045271899277),
(18, 30): (57.63422818791947, 9.60234899328859),
(18, 40): (57.74793388429752, 9.880165289256198),
(20, 30): (63.74999999999999, 9.682692307692308),
(20, 40): (61.32286995515695, 9.95067264573991),
(30, 40): (60.06637168141593, 10.393805309734514),
(30, 99): (45.63155267099351, 12.112830753869197),
(40, 50): (45.357142857142854, 11.82857142857143)

Accuracy drops but deviation is a bit more manageable. I'm still not sure if I have a bug somewhere, from the analysis I'm doing it all seems like it should work, but I'm getting the feeling that I'm still doing something off.

@yu4u When you get the chance could you suggest next steps you think I should take to minimize the deviation and maximize accuracy of the model? Thank you.

yu4u commented 6 years ago

Thank you again for detailed analysis. I checked the age distribution of the appa-real dataset and found it is highly biased...

https://github.com/yu4u/age-gender-estimation/blob/master/appa-real/check_age_distribution.ipynb

idanh commented 6 years ago

@yu4u oh. That explains UTK results on >60 but not at <20 . Is it possible for you to plot imdb's?

yu4u commented 6 years ago

age distribution of the imdb dataset:

figure_1

idanh commented 6 years ago

@yu4u I think that's part of the issue but since age group 18-40 gave me 80% accuracy +-17years I think that there's still some digging left to be done..

yu4u commented 6 years ago

age distribution of the utk dataset: https://github.com/yu4u/age-gender-estimation/blob/master/utkface/check_age_distribution.ipynb

yu4u commented 6 years ago

I think there are several things can be done:

create a new dataset "utk + appa-real-train" for training
train a model using the imdb dataset and then finetune using "utk + appa-real-train"
change model architecture; increase input size (e.g. 224x224), deepen or widen the WideResNet (but I have wanted to change the base model from WideResNet to the models supported by keras.applications)
add several augmentations related to contrast change or any

idanh commented 6 years ago

@yu4u When you say train and then finetune, do you mean train again with prior weights? Unfortunately I don't have the resources to train those model.. If I can be of any help in other ways to get more accuracy I wouldn't mind doing more analysis. Thanks!

yu4u commented 6 years ago

When you say train and then finetune, do you mean train again with prior weights?

Yes.

If I can be of any help in other ways to get more accuracy

One approach widely used is test-time augmentation (TTA); at test time, augment test images (e.g. flip and/or random crop and so forth), and take the mean of the results from the augmented images. This would boost the accuracy but it might be marginal.

idanh commented 6 years ago

Thank you for your suggestions, I tried augmenting the images, applying affine transformations (scaling, rotating, translation), for each image I've created 50 copies, where one if the original and the others are augmented version of it. In total, I had around 200k pictures for validation, and at the end I'm averaging out the predictions and looking at MAE again, it actually worsen the results, by around 25%. My guess is how the model was trained, Did you train it using augmented data as well?

idanh commented 6 years ago

Adding some interesting results, for age groups (0,2), (4,6), (8,13), (15,20), (25,32), (38,43), (48,53), (60, 99) described in https://arxiv.org/pdf/1710.02985.pdf (DeX, p10), on IMDB pretrained model validated with appa-real on apparent age estimation (not real age) I get 62.2% accuracy for exact match and 85.6% for 1-off.

yu4u commented 6 years ago

The default model was not trained using data augmentation. Would you try this model? https://github.com/yu4u/age-gender-estimation/releases/download/v0.5/weights.28-3.73.hdf5

The above model was trained with augmentation:

width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True,

The augmentation you tried might a little bit strong. Please try the same or weaker augmentation as that at training time.

idanh commented 6 years ago

@yu4u thanks for sharing! Yes of course, I'm using imgaug for augmentation and so using Fliplr(.5) and affine translate_percent={"x": (-.1, .1), "y": (-.1, .1)} which by Keras definition if I understood it correctly should be equivalent to what you wrote.

Here are the results:

For 10 total copies, one original the other augmented I get:
For my last configuration, non-augmented I get:

The augmented version has less variance as you'd expect from bootstrapping, but still they are pretty comparable, which hints to me that I might've done something wrong with augmentation maybe?

yu4u commented 6 years ago

Hmm, the TTA seems to be correct. The results indicate that TTA is not effective in this case. Currently I do not have good approach except training using "utk + appa-real-train".

idanh commented 6 years ago

I think training on 128x128 would also boost accuracy a bit. Unfortunately this is something that could take me about a month (already tried doing so on my machine) and probably will fail before. Thank you for all the help, highly appreciate it!

idanh commented 6 years ago

Something didn't sit with me for above results, and I just noticed that I had loaded the same pretrained non-augmented imdb weights for both augmented and non-augmented results. So I corrected that and those are the results:

There is definitely an improvement. With the augmented IMDB model on age groups above I get 67.5% accuracy on exact and 86.5% on 1-off.

LC9 commented 6 years ago

Dear @idanh how are you doing, I have a question how can you get the plots above in the page, do you run specific code ?

Cheers

idanh commented 6 years ago

@LC9 Using seaborn and matplotlib with python

yu4u commented 6 years ago

Hi @idanh

I trained model with "utk + appa-real-train": https://github.com/yu4u/age-gender-estimation/tree/master/age_estimation

Please note that the new model is different from the previous models (different base CNN and the new one only predicts ages).

idanh commented 6 years ago

@yu4u Going to test it right away! Few questions before so I test it correctly:

Should I use augmention and if so, is it the same?
What is the image size the model was trained on?
Should I use the same margin and cropping?
RGB or BGR?
(small edit :) InceptionResNetV2 or ResNet50?

Really appreciate your help, I'll post my results vs older model validated on appa real test + some contamination from IMDB to see if it generalized well.

Thank you!

yu4u commented 6 years ago

Thank you for immediate reply!

Should I use augmention and if so, is it the same?

No. Test time augmentation might not be effective. BTW, I used different augmentation from previous models. It can be found: https://github.com/yu4u/age-gender-estimation/blob/master/age_estimation/generator.py#L12-L30

What is the image size the model was trained on?

224x224

Should I use the same margin and cropping?

Yes. I cropped faces with 40% margin. The cropped faces in the appa-real dataset can be used "as it is".

RGB or BGR?

BGR

(small edit :) InceptionResNetV2 or ResNet50?

ResNet50

Thanks!

idanh commented 6 years ago

Thank you again @yu4u I started testing the model but I think something is off with my cropping. If that is not too much trouble, do you mind sharing your cropping code for UTK?

And regrading augmentation I'm interested on why do you think TTA won't be effective if the model was trained using one?

I loaded the model using:

base_model = ResNet50(include_top=False, weights='imagenet', input_shape=(224, 224, 3), pooling="avg")
prediction = Dense(units=101, kernel_initializer="he_normal", use_bias=False, activation="softmax",
                       name="pred_age")(base_model.output_layers[0].output)

age_only_model = Model(inputs=base_model.input, outputs=prediction)
age_only_model.load_weights(age_only_weight_file)

yu4u commented 6 years ago

I started testing the model but I think something is off with my cropping.

Evaluation on UTK? In terms of evaluation on appa-real, there is no need for cropping.

If that is not too much trouble, do you mind sharing your cropping code for UTK?

Please refer to the following script. https://github.com/yu4u/age-gender-estimation/blob/master/utkface/create_db_utkface_with_margin.py

And regrading augmentation I'm interested on why do you think TTA won't be effective if the model was trained using one?

From your evaluation results with TTA. Is there improvement on accuracy?

idanh commented 6 years ago

Evaluation on UTK?

I'm evaluating on appa test + adding .1% from UTK. Thank you for the margin code, seems like we're doing exactly the same margin.

From your evaluation results with TTA. Is there improvement on accuracy?

From the older model yes, it did improve (see above). I'll have more complete results (with and without augmentation) to share tomorrow. For now just by sampling pictures I knew had issues with older models, the current one is more certain about their age.

idanh commented 6 years ago

Hi, I took all test images from appa-real, to compare with above scores (same dataset) and plotted the results:

For augmented images (20 augmentations, 1 original + 19 defined as:

iaa.Sequential([
    iaa.Fliplr(0.1),
    iaa.Sometimes(.5, iaa.Affine(
        scale=(.8, 1.2)
    )),
    iaa.Sharpen(alpha=(0, .2), lightness=(0.8, 1.2))
]

For non-augmented images:

My process is as follows:

Load each image with PIL and create a thumbnail (224,224).
convert to RGB.
convert to numpy array, use dlib's detector to detect face edges.
resize the image if needed to (224,224) since thumbnail saves aspect ratio. no scaling is done here.
convert to BGR
do prediction step, repeat for 20x augmented images

For the same age groups as before and TTA, I see 67.3% and 85% 1-off accuracies, which is a bit lower from older model.

I'd say in total results looks better, maybe a bit more noisy but with lower mae. Do you think on anything that could be improved?

yu4u commented 6 years ago

Thank you for your analysis.

Do you think on anything that could be improved?

Hmm... How about ensemble with previous models? There would be several things to be done in training (e.g. pre-training on IMDB, stronger augmentation). I cannot work on this for several days ;(

idanh commented 6 years ago

@yu4u I deeply thank you for your help and advices, it's been very helpful and very enjoyable! Please don't take my comments as TODO's, brainstorming through your project helped me learn and look at different approaches on solving this.

The results are very promising as well, and I'll continue on improving and updating either in a comment or where you think is the correct place to.

Ensembling is something I've considered, might just do it as last resort. Fine tuning is another thing I was about to try, so might just do that first. Thank you again and have a nice night!

alvenchen commented 6 years ago

@idanh
hi.. Thanks for your excellent job. Have you do some smooth method on your plots?It seems have a nice accuracy. What am i do is simply plot dot(real_age, estimated_age),which has a lot of noise point. (test on appa validation dataset)

yu4u commented 6 years ago

Hi @idanh, How's it going?

By using Adam optimizer and label augmentation, MAE is improved to 4.410. If you have time, please try the following model:

https://github.com/yu4u/age-gender-estimation/releases/download/v0.5/age_only_resnet50_weights.061-3.300-4.410.hdf5

result

idanh commented 6 years ago

Hi @yu4u, All good, How are you doing? Yes of course, I'll share some results tomorrow. Can you share which label augmentations you used? thank you!

@alvenchen I used seaborn. I don't think/remember any smoothing begin done.

yu4u commented 6 years ago

Label augmentation is very simple but found to be effective. Simply add Gaussian noise to labels (ages):

https://github.com/yu4u/age-gender-estimation/blob/master/age_estimation/generator.py#L63

idanh commented 6 years ago

@yu4u Hey, I re-run the same analysis from before, and got the following:

Definitely an improvement over the last model, tho, my mae is 4.96 which is a bit far from the results you were seeing. I've used the same TTA as above, and I peeked at your code and didn't see any changes there.

P.S. I wonder if the increase in performance is due to the label augmentation. More specifically, we saw that age isn't distributed uniformly in the datasets so augmenting intuitively should smooth the neighbourhood around the mode. Do you think it was the change in optimizer that accounted for difference?

And thank you for sharing the new model!

Idan

yu4u commented 6 years ago

The improvement mainly comes from the label augmentation. Adam: MAE 5.511 Adam + label augmentation: 4.410

The label augmentation would prevent a model from overfitting. Simply use Gaussian distribution as the target distribution instead of one-hot vector might bring the same effect. The variance of the Gaussian distribution might have to be vary from sample to sample according to the labelers' label variance (they are included in the annotation data).

idanh commented 6 years ago

The variance of the Gaussian distribution might have to be vary from sample to sample according to the labelers' label variance (they are included in the annotation data).

I might have missed that in your code, mind pointing where you do that?

Thanks. I'll try and see whats causing my code to get worse MAE. Might be because I'm using 1% UTK images.

yu4u commented 6 years ago

I might have missed that in your code, mind pointing where you do that?

Sorry for confusing you. That is future possible work to further improve the performance.

Riolite5 commented 5 years ago

Hey, some more updates.. To verify on which age groups the IMDB model is mostly wrong on his current iteration (after above fixes) I did the following:

For each prediction, calculate the IQR of the discrete RV and use median ± iqr * wanted_confidence to get an age-range estimation, and from there I calculated 2 things:

Accuracy within age group

Average 'distance' between high and low (high-low)

The following results show what I got on the same dataset appa-real that I'm using with the same images, where wanted_confidence = .9:

Read as: (From Age Inclusive, To Age Inclusive): (accuracy, mean distance around high-low points)
(0, 18): (38.94472361809046, 17.33668341708543),
(0, 30): (65.29100529100529, 17.288888888888888),
(0, 100): (65.3094026252344, 19.60567907848915),
(18, 30): (81.71140939597315, 17.18959731543624),
(18, 40): (81.30165289256198, 17.674586776859503),
(20, 30): (85.76923076923076, 17.33269230769231),
(20, 40): (83.6322869955157, 17.79932735426009),
(30, 40): (81.30530973451327, 18.601769911504423),
(30, 99): (66.84972541188218, 21.767348976535196),
(40, 50): (68.21428571428572, 21.239285714285714)
You can see that tho from 20 to 30 the model works best, it has high deviation of 17 years on average. Assigning wanted_confidence to .5 to get a lower deviation we see that:
(0, 18): (14.949748743718594, 9.57286432160804),
(0, 30): (41.95767195767196, 9.612698412698412),
(0, 100): (42.72702919903563, 10.907045271899277),
(18, 30): (57.63422818791947, 9.60234899328859),
(18, 40): (57.74793388429752, 9.880165289256198),
(20, 30): (63.74999999999999, 9.682692307692308),
(20, 40): (61.32286995515695, 9.95067264573991),
(30, 40): (60.06637168141593, 10.393805309734514),
(30, 99): (45.63155267099351, 12.112830753869197),
(40, 50): (45.357142857142854, 11.82857142857143)
Accuracy drops but deviation is a bit more manageable. I'm still not sure if I have a bug somewhere, from the analysis I'm doing it all seems like it should work, but I'm getting the feeling that I'm still doing something off.

@yu4u When you get the chance could you suggest next steps you think I should take to minimize the deviation and maximize accuracy of the model? Thank you.

Could you please explain what exactly do you mean by confidence? and how is it affecting the mean and the standard deviation of the accuracy? Thanks

yu4u / age-gender-estimation

Age accuracy issues - For both IMDB and UTK #48