ziyin-dl / word-embedding-dimensionality-selection

On the Dimensionality of Word Embedding
https://nips.cc/Conferences/2018/Schedule?showEvent=12567
MIT License
329 stars 44 forks source link

Noise estimation in signal_matrix.py #17

Open JamesTuna opened 5 years ago

JamesTuna commented 5 years ago

Line: self.noise = np.std(diff) 0.5 seems to be inconsistent with Frobenius term used in paper: ||M1-M2||/(2sqrt(mn)). Instead it should be np.sum(diff2)(0.5)/(2n)?

Btw, isn't m=n=vocabulary size? Under this assumption, what's the physical meaning of noise in paper? For me, it doesn't make sense to apply same variance to every PMI entries. Can you help me to understand?

JamesTuna commented 5 years ago

The line of code is in signal_matrix.py.

JamesTuna commented 5 years ago

I mean sqrt[sum(diff ^2)]/(2n).

ziyin-dl commented 5 years ago

They are the same, right? Assuming (as in the paper) that noise is iid zero mean Gaussian for each entry of the PMI matrix, 0.5 * np.std(M1-M2) is the same as ||M1-M2||/(2sqrt(mn))=||M1-M2||/2n, the MLE for standard deviation. (yes, m=n for PMI matrix).

I would think of ''noise'' as the estimation error of the PMI matrix. The ij-th entry of PMI matrix is a function of the co-occurrence probabilities: log(Pij)/log(Pi)log(Pj). Now note that these co-occurrence probabilities have to be estimated from the corpus -- namely, we have to actually count the co-occurrences. So the PMI matrix we use (the one from counting co-occurrences) is a noisy estimation of the actual PMI matrix.

The noise being iid additive Gaussian is rather a choice we made in the model assumption. At least, in the asymptotic sense, I can argue that the noise has to take the form of additive Gaussian -- the counting process gives an efficient estimator, whose noise will approach a Gaussian distribution by the Hajek-Lecam convolution theorem. Now ideally, a most flexible model might assume that this gigantic additive noise follows N(0, Σ) distribution (note Σ is actually n^2 by n^2!!) where Σ can ideally be estimated. But this is way too prohibitive; there is no way one can estimate Σ. So instead, a reasonable model will just assume the additive noise follows N(0, I) (which is what we did).

JamesTuna commented 5 years ago

I don't think they are the same. Isn't np.std(M) for computing standard deviation over all entries of matrix M? For example, if M1-M2 is a matrix with entries being all 1 or all 0 or all any other value, then np.std(M1-M2) will return 0. For example if M1-M2 is a matrix with all 1s. Then you should get n/2n=0.5 but not 0. I think in your code computing Frobenius norm, you meant first to compute the sum of square for all entries.

ziyin-dl commented 5 years ago

Well that won't happen if the noise is zero mean (I believe the paper stated that assumption). And this is a reasonable assumption because of the symmetry of M1 and M2. The only difference is whether we subtract the sample mean or not, correct? I believe the sample mean is effectively zero (if i remember correctly, the sample mean is on the order of 1e-5).

I'm pretty sure if you cross compare the results of the two methods on a reasonable corpus, the difference should be negligible.

JamesTuna commented 5 years ago

Oh I see, thanks for the explanation. Another thing confused me a lot is that why the variance for entries in M1 and M2 is double the variance of that in M. I know M uses twice samples to compute frequency but I don't know how to prove. The assumption of variance is made on random variable of form V(x,y,N)=log(F(x,y)/(F(x)F(y))) where F is frequency function and N is the sample size. I mean, how we derive relation between Var(V(x,y,N)) and Var(V(x,y,N/2))?

ziyin-dl commented 5 years ago

We are not using anything specific to PMI. The assumption is that M1=M+E1 and M2=M+E2, where M is the actual PMI matrix and M1, M2 are the two estimates, and E1, E2 are their estimation noises. As a result M1-M2=E1-E2. So the variance of (M1-M2)_{ij} is twice that of E1 or E2 (again, assuming E1 and E2 are independently identically distributed).

JamesTuna commented 5 years ago

Sure, I know what you just said. I think you misunderstand my question. I know variance of (M1-M2) is twice that of M1(also M2) but I don't know why variance of M1 is twice the variance of M. M and M1 doesn't seem to have any relationship except sample size.

JamesTuna commented 5 years ago

To make my question clear: why variance of E1 or E2 is 2 sigma square? Where sigma square is variance for matrix computed using all samples.

ziyin-dl commented 5 years ago

For any efficient estimator the noise std decreases by sqrt 2 when sample size doubles. M has twice more data as M1.

On Tue, Sep 10, 2019, 12:04 AM JamesTuna notifications@github.com wrote:

Sure, I know what you just said. I think you misunderstand my question. I know variance of (M1-M2) is twice that of M1(also M2) but I don't know why variance of M1 is twice the variance of M. M and M1 doesn't seem to have any relationship except sample size.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ziyin-dl/word-embedding-dimensionality-selection/issues/17?email_source=notifications&email_token=AB7IJUEVLFXV26LW5QU4X2DQI4MEXA5CNFSM4IVAQFT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6JX44I#issuecomment-529759857, or mute the thread https://github.com/notifications/unsubscribe-auth/AB7IJUEA6BUAKSQVAEI4SQ3QI4MEXANCNFSM4IVAQFTQ .

JamesTuna commented 5 years ago

Thanks, I will think through the variance problem later. Regarding matrix symmetry problem in class MonteCarloEstimator, I have several questions.

  1. Why you generate X by (U_gen D_gen).dot(V_gen.T)? V_gen seems redundant and not reasonable because once X is symmetric, (U_gen D_gen).dot(U_gen.T) should be the form. I understand X as the clean signal matrix in the paper but if U_gen and V_gen are generated separately, the result won't necessarily be symmetric therefore X won't correspond to a true unobserved PMI.

  2. Same problem when generating noise matrix E. E = np.random.normal(0, sigma, size = shape) may not keep property of symmetry.

JamesTuna commented 5 years ago

For variance of error problem, I have trouble thinking through a simply constructed problem. Suppose X1 X2 ... iid random variables. Consider two cases:

1). Y(n) = (X1+X2+...+Xn) / n 2). Z(n) = log Y(n)

For 1) it's easy to show Var(Y(n)) = 2 Var(Y(2n)) using property Var(A+B) = Var(A) + Var(B) when A,B are independent and Var(cA) = c^2 Var(A)

But for 2), how can I show Var(Z(n)) = 2 Var(Z(2n))? I even doubt it holds true.

This is critical for me to understand noise estimation in the paper because I think when we talk about variance relation between two random variables, we need not only to focus on sample size but also the function form of those two random variables.

If you can help me prove in a mathematical way that Var(M) = 0.5 Var(M1), I would really appreciate it.

ziyin-dl commented 5 years ago

Under local asymptotic normalitiy this is just a simple delta method, right?

https://en.wikipedia.org/wiki/Delta_method

On Tue, Sep 10, 2019, 1:50 PM JamesTuna notifications@github.com wrote:

For variance of error problem, I have trouble thinking through a simply constructed problem. Suppose X1 X2 ... iid random variables. Consider two cases:

1). Y(n) = (X1+X2+...+Xn) / n 2). Z(n) = log Y(n)

For 1) it's easy to show Var(Y(n)) = 2 Var(Y(2n)) using property Var(A+B) = Var(A) + Var(B) when A,B are independent and Var(cA) = c^2 Var(A)

But for 2), how can I show Var(Z(n)) = 2 Var(Z(2n))? I even doubt it holds true.

This is critical for me to understand noise estimation in the paper because I think when we talk about variance relation between two random variables, we need not only to focus on sample size but also the function form of those two random variables.

If you can help me prove in a mathematical way that Var(M) = 0.5 Var(M1), I would really appreciate it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ziyin-dl/word-embedding-dimensionality-selection/issues/17?email_source=notifications&email_token=AB7IJUHP7GDFULSKL5IEVW3QI7NAFA5CNFSM4IVAQFT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6L6M4Y#issuecomment-530048627, or mute the thread https://github.com/notifications/unsubscribe-auth/AB7IJUEXNE7HZSBMGELGTC3QI7NAFANCNFSM4IVAQFTQ .