Open idanh opened 6 years ago
I'd also add that if it's not apparent from the plots, both models are under-predicting. sometimes by 20 years margin. It's getting worse as the age increases. I can plot the residuals to show the effect, but I think plots above are showing this clearly. Let me know if I can help by sharing more data.
Thank you for your useful results! I'm surprised to see 1) that gray-scaled images works better and 2) that the model trained on the utk dataset does not perform better than the model trained on the imdb dataset.
Let me confirm the following description:
Create 2 copies of each image, one is gray scaled and the other RGB
the input to the model should be BGR images (not RGB) because the training images are loaded via OpenCV. If you are feeding RGB images, changing the channel order might solve problem.
Both models are struggling with younger and older ages This may be caused by 1) distribution of age on training dataset and 2) the estimated age calculated as expectation (sum{age} age*P(age)) instead of argmax{age} P(age).
@yu4u thank you! I was surprised as well from the results so I posted here to get your feedback, and I applied your suggestions (regrading argmax, I don't know why I didn't check that first.. :))
Regrading distribution of age on training dataset, for IMDB my assumption was that it may pose an issue, but with UTK faces I thought that those are distributed evenly? If not, I'd guess the model is biased. Did you get the chance to train with APPA-REAL?
In any case, I applied your suggestion and you were correct in debugging me with BGR images (I'm assuming same input for both models as they were trained using same code?)
Here are my results still using appa-real for validation:
As you can tell the results are slightly better with BGR images over our best gray scaled results from before.
Using argmax:
I'd say more variance with argmax. Using expected value seems to be the correct way. Would you think there is any more improvement in image cropping or anything I should take a look at?
Hey, some more updates.. To verify on which age groups the IMDB model is mostly wrong on his current iteration (after above fixes) I did the following:
For each prediction, calculate the IQR of the discrete RV and use median ± iqr * wanted_confidence
to get an age-range estimation, and from there I calculated 2 things:
The following results show what I got on the same dataset appa-real that I'm using with the same images, where wanted_confidence = .9:
Read as: (From Age Inclusive, To Age Inclusive): (accuracy, mean distance around high-low points)
(0, 18): (38.94472361809046, 17.33668341708543),
(0, 30): (65.29100529100529, 17.288888888888888),
(0, 100): (65.3094026252344, 19.60567907848915),
(18, 30): (81.71140939597315, 17.18959731543624),
(18, 40): (81.30165289256198, 17.674586776859503),
(20, 30): (85.76923076923076, 17.33269230769231),
(20, 40): (83.6322869955157, 17.79932735426009),
(30, 40): (81.30530973451327, 18.601769911504423),
(30, 99): (66.84972541188218, 21.767348976535196),
(40, 50): (68.21428571428572, 21.239285714285714)
You can see that tho from 20 to 30 the model works best, it has high deviation of 17 years on average. Assigning wanted_confidence to .5 to get a lower deviation we see that:
(0, 18): (14.949748743718594, 9.57286432160804),
(0, 30): (41.95767195767196, 9.612698412698412),
(0, 100): (42.72702919903563, 10.907045271899277),
(18, 30): (57.63422818791947, 9.60234899328859),
(18, 40): (57.74793388429752, 9.880165289256198),
(20, 30): (63.74999999999999, 9.682692307692308),
(20, 40): (61.32286995515695, 9.95067264573991),
(30, 40): (60.06637168141593, 10.393805309734514),
(30, 99): (45.63155267099351, 12.112830753869197),
(40, 50): (45.357142857142854, 11.82857142857143)
Accuracy drops but deviation is a bit more manageable. I'm still not sure if I have a bug somewhere, from the analysis I'm doing it all seems like it should work, but I'm getting the feeling that I'm still doing something off.
@yu4u When you get the chance could you suggest next steps you think I should take to minimize the deviation and maximize accuracy of the model? Thank you.
Thank you again for detailed analysis. I checked the age distribution of the appa-real dataset and found it is highly biased...
https://github.com/yu4u/age-gender-estimation/blob/master/appa-real/check_age_distribution.ipynb
@yu4u oh. That explains UTK results on >60 but not at <20 . Is it possible for you to plot imdb's?
age distribution of the imdb dataset:
@yu4u I think that's part of the issue but since age group 18-40 gave me 80% accuracy +-17years I think that there's still some digging left to be done..
age distribution of the utk dataset: https://github.com/yu4u/age-gender-estimation/blob/master/utkface/check_age_distribution.ipynb
I think there are several things can be done:
@yu4u When you say train and then finetune, do you mean train again with prior weights? Unfortunately I don't have the resources to train those model.. If I can be of any help in other ways to get more accuracy I wouldn't mind doing more analysis. Thanks!
When you say train and then finetune, do you mean train again with prior weights?
Yes.
If I can be of any help in other ways to get more accuracy
One approach widely used is test-time augmentation (TTA); at test time, augment test images (e.g. flip and/or random crop and so forth), and take the mean of the results from the augmented images. This would boost the accuracy but it might be marginal.
Thank you for your suggestions, I tried augmenting the images, applying affine transformations (scaling, rotating, translation), for each image I've created 50 copies, where one if the original and the others are augmented version of it. In total, I had around 200k pictures for validation, and at the end I'm averaging out the predictions and looking at MAE again, it actually worsen the results, by around 25%. My guess is how the model was trained, Did you train it using augmented data as well?
Adding some interesting results, for age groups (0,2), (4,6), (8,13), (15,20), (25,32), (38,43), (48,53), (60, 99) described in https://arxiv.org/pdf/1710.02985.pdf (DeX, p10), on IMDB pretrained model validated with appa-real on apparent age estimation (not real age) I get 62.2% accuracy for exact match and 85.6% for 1-off.
The default model was not trained using data augmentation. Would you try this model? https://github.com/yu4u/age-gender-estimation/releases/download/v0.5/weights.28-3.73.hdf5
The above model was trained with augmentation:
width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True,
The augmentation you tried might a little bit strong. Please try the same or weaker augmentation as that at training time.
@yu4u thanks for sharing! Yes of course, I'm using imgaug for augmentation and so using Fliplr(.5) and affine translate_percent={"x": (-.1, .1), "y": (-.1, .1)} which by Keras definition if I understood it correctly should be equivalent to what you wrote.
Here are the results:
For 10 total copies, one original the other augmented I get:
For my last configuration, non-augmented I get:
The augmented version has less variance as you'd expect from bootstrapping, but still they are pretty comparable, which hints to me that I might've done something wrong with augmentation maybe?
Hmm, the TTA seems to be correct. The results indicate that TTA is not effective in this case. Currently I do not have good approach except training using "utk + appa-real-train".
I think training on 128x128 would also boost accuracy a bit. Unfortunately this is something that could take me about a month (already tried doing so on my machine) and probably will fail before. Thank you for all the help, highly appreciate it!
Something didn't sit with me for above results, and I just noticed that I had loaded the same pretrained non-augmented imdb weights for both augmented and non-augmented results. So I corrected that and those are the results:
There is definitely an improvement. With the augmented IMDB model on age groups above I get 67.5% accuracy on exact and 86.5% on 1-off.
Dear @idanh how are you doing, I have a question how can you get the plots above in the page, do you run specific code ?
Cheers
@LC9 Using seaborn and matplotlib with python
Hi @idanh
I trained model with "utk + appa-real-train": https://github.com/yu4u/age-gender-estimation/tree/master/age_estimation
Please note that the new model is different from the previous models (different base CNN and the new one only predicts ages).
@yu4u Going to test it right away! Few questions before so I test it correctly:
Really appreciate your help, I'll post my results vs older model validated on appa real test + some contamination from IMDB to see if it generalized well.
Thank you!
Thank you for immediate reply!
- Should I use augmention and if so, is it the same?
No. Test time augmentation might not be effective. BTW, I used different augmentation from previous models. It can be found: https://github.com/yu4u/age-gender-estimation/blob/master/age_estimation/generator.py#L12-L30
- What is the image size the model was trained on?
224x224
- Should I use the same margin and cropping?
Yes. I cropped faces with 40% margin. The cropped faces in the appa-real dataset can be used "as it is".
- RGB or BGR?
BGR
- (small edit :) InceptionResNetV2 or ResNet50?
ResNet50
Thanks!
Thank you again @yu4u I started testing the model but I think something is off with my cropping. If that is not too much trouble, do you mind sharing your cropping code for UTK?
And regrading augmentation I'm interested on why do you think TTA won't be effective if the model was trained using one?
I loaded the model using:
base_model = ResNet50(include_top=False, weights='imagenet', input_shape=(224, 224, 3), pooling="avg")
prediction = Dense(units=101, kernel_initializer="he_normal", use_bias=False, activation="softmax",
name="pred_age")(base_model.output_layers[0].output)
age_only_model = Model(inputs=base_model.input, outputs=prediction)
age_only_model.load_weights(age_only_weight_file)
I started testing the model but I think something is off with my cropping.
Evaluation on UTK? In terms of evaluation on appa-real, there is no need for cropping.
If that is not too much trouble, do you mind sharing your cropping code for UTK?
Please refer to the following script. https://github.com/yu4u/age-gender-estimation/blob/master/utkface/create_db_utkface_with_margin.py
And regrading augmentation I'm interested on why do you think TTA won't be effective if the model was trained using one?
From your evaluation results with TTA. Is there improvement on accuracy?
Evaluation on UTK?
I'm evaluating on appa test + adding .1% from UTK. Thank you for the margin code, seems like we're doing exactly the same margin.
From your evaluation results with TTA. Is there improvement on accuracy?
From the older model yes, it did improve (see above). I'll have more complete results (with and without augmentation) to share tomorrow. For now just by sampling pictures I knew had issues with older models, the current one is more certain about their age.
Hi, I took all test images from appa-real, to compare with above scores (same dataset) and plotted the results:
For augmented images (20 augmentations, 1 original + 19 defined as:
iaa.Sequential([
iaa.Fliplr(0.1),
iaa.Sometimes(.5, iaa.Affine(
scale=(.8, 1.2)
)),
iaa.Sharpen(alpha=(0, .2), lightness=(0.8, 1.2))
]
For non-augmented images:
My process is as follows:
For the same age groups as before and TTA, I see 67.3% and 85% 1-off accuracies, which is a bit lower from older model.
I'd say in total results looks better, maybe a bit more noisy but with lower mae. Do you think on anything that could be improved?
Thank you for your analysis.
Do you think on anything that could be improved?
Hmm... How about ensemble with previous models? There would be several things to be done in training (e.g. pre-training on IMDB, stronger augmentation). I cannot work on this for several days ;(
@yu4u I deeply thank you for your help and advices, it's been very helpful and very enjoyable! Please don't take my comments as TODO's, brainstorming through your project helped me learn and look at different approaches on solving this.
The results are very promising as well, and I'll continue on improving and updating either in a comment or where you think is the correct place to.
Ensembling is something I've considered, might just do it as last resort. Fine tuning is another thing I was about to try, so might just do that first. Thank you again and have a nice night!
@idanh
hi.. Thanks for your excellent job. Have you do some smooth method on your plots?It seems have a nice accuracy.
What am i do is simply plot dot(real_age, estimated_age),which has a lot of noise point.
(test on appa validation dataset)
Hi @idanh, How's it going?
By using Adam optimizer and label augmentation, MAE is improved to 4.410. If you have time, please try the following model:
Hi @yu4u, All good, How are you doing? Yes of course, I'll share some results tomorrow. Can you share which label augmentations you used? thank you!
@alvenchen I used seaborn. I don't think/remember any smoothing begin done.
Label augmentation is very simple but found to be effective. Simply add Gaussian noise to labels (ages):
https://github.com/yu4u/age-gender-estimation/blob/master/age_estimation/generator.py#L63
@yu4u Hey, I re-run the same analysis from before, and got the following:
Definitely an improvement over the last model, tho, my mae is 4.96 which is a bit far from the results you were seeing. I've used the same TTA as above, and I peeked at your code and didn't see any changes there.
P.S. I wonder if the increase in performance is due to the label augmentation. More specifically, we saw that age isn't distributed uniformly in the datasets so augmenting intuitively should smooth the neighbourhood around the mode. Do you think it was the change in optimizer that accounted for difference?
And thank you for sharing the new model!
The improvement mainly comes from the label augmentation. Adam: MAE 5.511 Adam + label augmentation: 4.410
The label augmentation would prevent a model from overfitting. Simply use Gaussian distribution as the target distribution instead of one-hot vector might bring the same effect. The variance of the Gaussian distribution might have to be vary from sample to sample according to the labelers' label variance (they are included in the annotation data).
The variance of the Gaussian distribution might have to be vary from sample to sample according to the labelers' label variance (they are included in the annotation data).
I might have missed that in your code, mind pointing where you do that?
Thanks. I'll try and see whats causing my code to get worse MAE. Might be because I'm using 1% UTK images.
I might have missed that in your code, mind pointing where you do that?
Sorry for confusing you. That is future possible work to further improve the performance.
Hey, some more updates.. To verify on which age groups the IMDB model is mostly wrong on his current iteration (after above fixes) I did the following:
For each prediction, calculate the IQR of the discrete RV and use
median ± iqr * wanted_confidence
to get an age-range estimation, and from there I calculated 2 things:
- Accuracy within age group
- Average 'distance' between high and low (high-low)
The following results show what I got on the same dataset appa-real that I'm using with the same images, where wanted_confidence = .9:
Read as: (From Age Inclusive, To Age Inclusive): (accuracy, mean distance around high-low points)
(0, 18): (38.94472361809046, 17.33668341708543), (0, 30): (65.29100529100529, 17.288888888888888), (0, 100): (65.3094026252344, 19.60567907848915), (18, 30): (81.71140939597315, 17.18959731543624), (18, 40): (81.30165289256198, 17.674586776859503), (20, 30): (85.76923076923076, 17.33269230769231), (20, 40): (83.6322869955157, 17.79932735426009), (30, 40): (81.30530973451327, 18.601769911504423), (30, 99): (66.84972541188218, 21.767348976535196), (40, 50): (68.21428571428572, 21.239285714285714)
You can see that tho from 20 to 30 the model works best, it has high deviation of 17 years on average. Assigning wanted_confidence to .5 to get a lower deviation we see that:
(0, 18): (14.949748743718594, 9.57286432160804), (0, 30): (41.95767195767196, 9.612698412698412), (0, 100): (42.72702919903563, 10.907045271899277), (18, 30): (57.63422818791947, 9.60234899328859), (18, 40): (57.74793388429752, 9.880165289256198), (20, 30): (63.74999999999999, 9.682692307692308), (20, 40): (61.32286995515695, 9.95067264573991), (30, 40): (60.06637168141593, 10.393805309734514), (30, 99): (45.63155267099351, 12.112830753869197), (40, 50): (45.357142857142854, 11.82857142857143)
Accuracy drops but deviation is a bit more manageable. I'm still not sure if I have a bug somewhere, from the analysis I'm doing it all seems like it should work, but I'm getting the feeling that I'm still doing something off.
@yu4u When you get the chance could you suggest next steps you think I should take to minimize the deviation and maximize accuracy of the model? Thank you.
Could you please explain what exactly do you mean by confidence? and how is it affecting the mean and the standard deviation of the accuracy? Thanks
Hi,
I've been doing some simple analysis over both pre-trained models, validating using appa-real dataset. Here's what I'm doing:
For each cropped and rotated image i.e. (_face.jpg) prefix:
Note that I'm using real_age as supplied by appa dataset and not the expected value of apperent_age. In total I've sampled 1978 images, I see that:
I'm attaching some plots I've done, where my goal is to find if either I'm doing something incorrectly or maybe the accuracy is just off a bit? Maybe best to re-train on another dataset?
Again, thank you @yu4u for this project, I'm assuming I'm at fault here. Would love to get your feedback.