ykwon0407 / UQ_BNN

Uncertainty quantification using Bayesian neural networks in classification (MIDL 2018, CSDA)
132 stars 21 forks source link

About the eq.4 #3

Closed ykwon0407 closed 5 years ago

ykwon0407 commented 5 years ago

Could you please let me know whether the eq. 4 in the paper is applicable for multi label segmentation or just the binary segmentation?

_Originally posted by @redsadaf in https://github.com/ykwon0407/UQ_BNN/issues/1#issuecomment-437551961_

ykwon0407 commented 5 years ago

Dear redsadaf,

Thank you for your interests! The eq.4 in the paper is defined for multi-label segmentation. So you can apply the equation for not only binary segmentation but multi-class segmentation problems. Please note that if the eq.4 will provide a K by K matrix if there are K categories in your dataset. . In the case of binary classification (when K=2), the eq.4 produces a 2 by 2 matrix. However, the two diagonal elements are just the same each other, and similarly, two different off-diagonal terms are also same. Thus, we obtain numeric values, not matrices, for uncertainty maps. . Please let me know if you have any further questions and hope this is informative!!

(I copy and paste the reply in #1 ).

mongoose54 commented 5 years ago

@ykwon0407 Thank you for the reply. I am posting here my reply to follow the thread properly.

A couple of clarification questions:

  1. For multi-label segmentation with K classes then we need to perform the algebra calculations: transposing ?

  2. In your paper the proof of eq. 4 lies in paragraph A of the Appendix. Is that correct?

ykwon0407 commented 5 years ago

@mongoose54 Hi~ Here is the point-by-point response.

  1. Yes, it is. You need to use transpose. Please note that resulting uncertainty matrices are a K by K matrix.

  2. Quite close but not exactly. Appendix A shows the derivation of the equation (2), which is a population version of uncertainties. In contrast, the equation (4) is an estimator for the equation (2) !! That is, we are only able to utilize the equation (4) with data, and its converging point is the equation (4), the variance of a variational predictive distribution.

mongoose54 commented 5 years ago

@ykwon0407 Thanks again for the wonderful explanation. In regards to the KxK matrix for K classes, what does each element represent (is it the degree of uncertainty between 2 classes) and what is the best way to get a single uncertainty value?

ykwon0407 commented 5 years ago

@mongoose54 Hello! :)

  1. The K by K matrix can be considered as a proxy to a variance matrix of a multinomial distribution. So each element in the uncertainty matrix is nothing but correlation or variance of two components of an outcome. [Additional information for 1.] Writing a dependent variable Y with one-hot encoding expression, then Y is a K length vector and it can be assumed to follow a multinomial distribution. I believe that you can easily find almost the same thing in the following wiki. Multinomial Wiki

  2. First of all, as explained in the previous point, each element has the meaning. Thus, picking a specific element may give you some information. I guess there are so many examples can be made but the most interesting example is 'the sum of diagonal elements in the aleatoric uncertainty matrix', which can be shown to be very similar to Shannon's entropy.

ShellingFord221 commented 5 years ago

Sorry, why does Eq. 4 provide a K*K matrix when there are K classes?

ShellingFord221 commented 5 years ago

Besides, p_hat is just a list of 10 probabilities of some certain class (according to line 63 in /retina/utils.py), why does it have diagonal matrix?

ykwon0407 commented 5 years ago

@ShellingFord221 Hi~~ Here is the point-by-point response.

  1. Sorry, why does Eq. 4 provide a K*K matrix when there are K classes? -> In case you are solving a K-class classification problem, then a probability estimate (p_hat) will be represented as a K-length vector. Then, the proposed uncertainties, which can be considered as a naive variance, are nothing but a K by K matrix.

  2. Besides, p_hat is just a list of 10 probabilities of some certain class (according to line 63 in /retina/utils.py), why does it have diagonal matrix? -> If you run with setting a number of a random draw T as 10, then p_hat will be a (10, ) numpy array. I have no idea about the diagonal matrix...

ShellingFord221 commented 5 years ago

The diagonal matrix is mentioned in Eq. 4 in you paper, diag(p_hat)

ShellingFord221 commented 5 years ago

emmm... p_hat shoud be a matrix of size (num_samples, num_classes) (i.e. (10, 3) in my settings)?

ykwon0407 commented 5 years ago

@ShellingFord221

The diagonal matrix is mentioned in Eq. 4 in you paper, diag(p_hat) -> Ah. I got it. The diagonal matrix is from the covariance matrix of the multinomial distribution. Please find the link.

emmm... p_hat shoud be a matrix of size (num_samples, num_classes) (i.e. (10, 3) in my settings)? -> Yes, it is. Sorry for my binary classification code... (it assumes a lot..)

Let me clear all the details in the following

In /retina/utils.py

p_hat = np.array(p_hat) # line number 64
prediction = np.mean(p_hat, axis=0) # line number 67

p_hat should be a numpy array of size (num_samples, num_classes) prediction should be a numpy array of size (num_classes, )

Then the aleatoric and epistemic matrix will be as follows.

aleatoric = np.diag(prediction) - p_hat.T.dot(p_hat)/p_hat.shape[0] # 3 by 3 matrix # I corrected an error after the discussion with ShellingFord221
tmp = p_hat - prediction  # 10 by 3 matrix
epistemic = tmp.T.dot(tmp)/tmp.shape[0]

Hope this information helps you!

ShellingFord221 commented 5 years ago

Thank you so much!!! But there's still a little question. In Eq. 4 of your paper, the diag is about p_hat, but in your codes above, it seems that diag is about prediction (the mean of p_hat).

ShellingFord221 commented 5 years ago

And why should the dot product of the matrix be divided by shape[0]? (p_hat.T.dot(p_hat)/prediction.shape[0])

ykwon0407 commented 5 years ago

@ShellingFord221 You're welcome! :) 1. In Eq. 4 of your paper, the diag is about p_hat, but in your codes above, it seems that diag is about prediction (the mean of p_hat). -> Because an average of diagonal of p_hat equals a diagonal matrix of prediction.

2. And why should the dot product of the matrix be divided by shape[0]? (p_hat.T.dot(p_hat)/prediction.shape[0]) -> In Eq.4, we need to divide by a number of random samples T, so I divide the p_hat.T.dot(p_hat) by prediction.shape[0].

ShellingFord221 commented 5 years ago

Thanks again for your kindly reply! But prediction.shape[0] should be the number of classes, not the number of samples.

ykwon0407 commented 5 years ago

@ShellingFord221 You are right! My bad. It should be p_hat.shape[0], not prediction.shape[0]. I corrected the above code as well. Thanks!!!

ShellingFord221 commented 5 years ago

The sum of diagonal elements in the aleatoric uncertainty matrix is meaningful, is the sum of diagonal elements in the epistemic uncertainty matrix meaningful, too? Besides, does the aleatoric uncertainty mean the uncertainty about the test data, and epistemic uncertainty mean the uncertainty about the model?

ykwon0407 commented 5 years ago

@ShellingFord221

The sum of diagonal elements in the aleatoric uncertainty matrix is meaningful, is the sum of diagonal elements in the epistemic uncertainty matrix meaningful, too? -> I guess.. somehow yes.

Besides, does the aleatoric uncertainty mean the uncertainty about the test data, and epistemic uncertainty mean the uncertainty about the model? -> They are not exactly separated, but it can be considered.

ShellingFord221 commented 5 years ago

The claim that aleatoric uncertainty means the uncertainty about the test data and epistemic uncertainty means the uncertainty about the model is also from this paper, Bayesian Convolutional Neural Networks with Variational Inference (the paragraph above section 6 experiments). But I have read his code, he mistakes the calculation of uncertainty about binary classification and multi-label classification, therefore his result about these two uncertainty is a number, rather than a K*K matrix (Table 2 in his paper).

ShellingFord221 commented 5 years ago

Besides, if I want to calculate the whole uncertainty (i.e. the sum of two uncertainties), should I:

  1. first calculate the sum of diagonal elements in aleatoric matrix and the sum of diagonal elements in epistemic matrix, then calculate the sum of these two numbers
  2. directly calculate the sum of aleatoric matrix and epistemic matrix as the final matrix, then calculate the sum of diagonal elements in this final matrix
ykwon0407 commented 5 years ago

@ShellingFord221 Either way is fine!

ShellingFord221 commented 4 years ago

Hi, after a half of year, it seems that I am confused again about the code above o(╥﹏╥)o . The diag of p_hat is averaged, but the p_hat.T.dot(phat) part seems only divided by the number of samples. But in Eq. 4, this part should be summed, then be divided (i.e. first \sum(t=1)^T p_hat.T.dot(p_hat), then this part is divided by the number of samples). And the situation is the same about tmp.T.dot(tmp) part (it is also divided by the number of samples in the code above, but there is no sum of T parts).

ykwon0407 commented 4 years ago

@ShellingFord221 Hi, again!. The dot product operations will sum over elements. Please see this link together https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html.