mistake in Bayes by Backprop from scratch

zackchase / mxnet-the-straight-dope

An interactive book on deep learning. Much easy, so MXNet. Wow. [Straight Dope is growing up] ---> Much of this content has been incorporated into the new Dive into Deep Learning Book available at https://d2l.ai/.

https://d2l.ai/

Apache License 2.0

2.56k stars 724 forks source link

mistake in Bayes by Backprop from scratch #564

Open Toooodd opened 5 years ago

Toooodd commented 5 years ago

def evaluate_accuracy(data_iterator, net, layer_params): numerator = 0. denominator = 0. for i, (data, label) in enumerate(data_iterator): data = data.as_in_context(ctx).reshape((-1, 784)) label = label.as_in_context(ctx) output = net(data, layer_params) predictions = nd.argmax(output, axis=1) numerator += nd.sum(predictions == label) denominator += data.shape[0] return (numerator / denominator).asscalar()

I think that layer_params should not be a fixed value when u predict the model, it would change every time u predict

ykun91 commented 5 years ago

I guess the author did that purposely. since when evaluating the accuracy of an clustering problem, we just take the maximum value of output mean as the network's answer, and ignore output variance information. so the author did not sampling the weight and make the network just output the mean value.

Toooodd commented 5 years ago

you are right, Yang. But I still think that when we predict, we should consider the disturbance term of W and compute the results multiple times and then average. This is more in line with the original intent of the article.

ykun91 commented 5 years ago

yeah but in my opinion, the mean of network output is decided by the μ parameter in that network, and the variance of output will be decided by σ parameter which σ=log(1+exp(ρ)). We can simply get the average by disable weight sampling and only use the μ to predict once.

If you take μ + σ・ε to make predict multiple times and then average, I think the average will finally converge to the μ. so it maybe nonsense to do like that, I think...

Toooodd commented 5 years ago

yeah, you are absolutely right, and I also recognize your opinion. But what I want to talk about is that it may be more in line with the original intent of the article, and represent the advantages of this method when predict the unseen data and plot it. haha, nice to meet u, yang! you are so active, :)

ykun91 commented 5 years ago

nice to meet u too. :) and, I think the problem is, if you want to take the advantages of the σ・ε in an clustering problem, you need to make a method to evaluate accuracy which consider about the variance. for example, if the network output a best answer with large variance and a better answer with small variance, take the better as the finally answer.

Toooodd commented 5 years ago

that's great solution. and I suddenly realized you are right from practical and academic perspective.