Open amirhfarzaneh opened 6 years ago
First of all, I advice you to read paper: https://arxiv.org/abs/1801.07698 It contain some summary of L/A-Softmax based idea. Here is very nice plot of ψ functions, based on several ideas.
Notice SoftMax
in this plot. So look like that most of the functions just are monotonically decreasing because base SoftMax
have this property. So your question should be: Why Softmax is monotonically decreasing for every interval
?
By the SoftMax
here we are thinking about last linear layer and SoftMax normalization (this is crucial for next).
To make things easier to explain (and it also for better for Metric-Learning), we are normalizing all features before final layer by L2 norm and also the weight of final layer are normalized by L2 norm. Then multiplying features and weights is just cosine similarity distance
. How the output from such distance can look like? It is just cosine
function, where x is angle
, y is value from -1 to 1.
In this cosine plot we are interested in interval 0-180. Then it look exactly the same like in ArcFace paper. What doesangle
in x axis mean? It is the measure of similarity between two vectors considering just direction
of them (as magnitude is normalized). If vectors are similar, the angle ~ 0 degree (and this is our aim in training, that feature representing class should be very similar to weight representing same class), if point completely different direction (lie on the same line, but point t opposed direction) then angle is ~180.
So monotonically decreasing
is natural property of cosine similarity
.
This is just one way of explaining, there are many more.
But we can still think: what would happens if it would not be monotonically decreasing?
From the theoretical point of view, if the function would look like presented cosine
function, but x values would be from 0 - 180 degree with interval on y [1, 0, -1, 0, 1] (so just squash the cosine to 2x smaller x
axis) ?
Our aim is maximize the value of similarity for same classes and minimize for different classes.
But this function have two maximum for same classes
, which network should choose? It should should choose '1' at 0 degree because at 180 is completely different vector. Also the minimum of different classes are in 90 degree, so different classes should have some similarity between them (completely nonsense).
This mean that non-monotonically decreasing can produce some local minima with very bad output, which could be very hard to escape. So it is better to design function which are monotonically.
This is my explanation which come from studding this topic for a while. It is not perfect, it would need a blog post to explain all idea behind it.
Great explanation. In a nutshell, the increasing part of the curve has opposite gradients.. This means that increasing curves will push features away from class center!
I'm sure you don't want such property...
@melgor Thank you for your thorough response. It clarified a lot of things. I just have this question: in the first plot, it seems that the author is only drawing cos(theta) in the range 0 to 180. but the target logit is ||W|| ||X|| cos(theta) so the logit is not only dependent on the cosine function, but also the multiplication of ||w|| and ||x||
@amirhfarzaneh
Yes, this should not be target logit
. We have already discussed this issue in https://github.com/happynear/AMSoftmax/issues/8 .
Can anybody please explain why ψ should be monotonically decreasing for every interval?