xfim / ggmcmc

Graphical tools for analyzing Markov Chain Monte Carlo simulations from Bayesian inference
111 stars 31 forks source link

stat_bin warning #12

Closed dmenne closed 11 years ago

dmenne commented 11 years ago

Plotting histogram gives many ugly infos:

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

I know these are not critical, but it would be better if they were suppressed by providing a fixed binwidth.

zmjones commented 11 years ago

it would be easy to just suppress the warning internally or an argument to the function could be added with range/30 as the default but with the option for the user to override it.

dmenne commented 11 years ago

Any chance to have this minor issue corrected?

zmjones commented 11 years ago

maintainer has been MIA for a bit. I added a default and pushed it to my fork, which you can install from its repo with devtools.

dmenne commented 11 years ago

Thanks, zmjones, I had already corrected it in my local installation, but it was a bit of a PITA to keep my colleagues' installation up to date.

jucor commented 11 years ago

The problem here is how to set the binwidth for each variable independently. That's quite cumbersome, if I'm not mistaken...

dmenne commented 11 years ago

kohske's solution on

http://stackoverflow.com/questions/17271968/different-breaks-per-facet-in-ggplot2-histogram

is not really simple, but quite general, and he knows this stuff.

I have an internal (lattice) version that does it slightly different: Force same bin-width, same scale for all subitems, e.g. v_0, v_1. In stan output these are easy to find, but for a general solutions on should do some blocking along the lines of

unique(grep part before [])

This choice is much more helpful than individual scaling, because you can compare easier.

xfim commented 11 years ago

I'm not a big fan of setting independent binwidths for each variable. The purpose with ggs_histogram() is to give a sense of the overall distribution of the posterior. Basically, the idea is to check convergence by looking at non-normal annomalies in the histogram. The aim is not to produce a fine-grained histogram of each variables separately.

One of the nice things of ggplot2 is the possibility to add layers. In this case, if the user wants simply a different binwidth, it is as simple as adding "+ geom_histogram(binwidth=X)" at the ggs_histogram() call:

ggs_histogram(S) + geom_histogram(binwidth=X)

So my preferred way would imply to avoid messages (as suggested by zmjones) and use the implicit ggplot2 default (range/30), leaving to the user the possibility to explicitly changing it via adding another layer.

jucor commented 11 years ago

When your variables have different units or ranges, a single bindwidth for all variables will not likely allow to display many variables correclty. I think it will therefore make it difficult to check for the convergence as you suggest, Xavier: you will likely not be able to see the anomalies in most variables, since all points may be merged in a single bin due to inadequate binwidth -- at least if I understand correctly what you suggest.

dmenne commented 11 years ago

Agreed, as long as the warnings are gone, I am happy. My internal (lattice) alternative does not use independent binwidths, but common ones for each type of variables (eg. all v[1],v[2] have common ranges/width). This give much better overviews; I remember a case where I had forgotten to add data for one variable, so the posterior = prior, but I did not notice it for some time because of the separate scaling.

jucor commented 11 years ago

Yes, scaling by type of variable sounds like the best of both worlds !

xfim commented 11 years ago

Scaling by family of variable should be trivial (using your notation):

ggs_histogram(S, family="v\\[") + xlim(c(a, b))

Where you can set 'a' and 'b' to be any values that you want.

Again, the idea is to make ggmcmc simple but robust, so as to allow the user to build from it. In this sense, the objects returned by ggs_...() functions are intended to be minimal and, so, let the user post-processing the rest.

dmenne commented 11 years ago

Good point; I should have thought of the fact that ggs_histogram is nothing than a disguised q/ggplot. I suggest to add this as an example.

xfim commented 11 years ago

Ok, I will add it to the documentation.

I'm currently working on a lower level for the implementation of histograms. In fact, geom_histogram() is itself a shortcut for geom_bar().

xfim commented 11 years ago

At the end, allowing ggs_histogram() to use a pre-specified number of bins has turned to be harder that I expected. The code can be improved without having to unlist and writing a nested **ply function. But it works and does the job (and supresses the warning messages, which in the end it what the objective).

Solved by commit #6ce184ddc.

jucor commented 11 years ago

Well done!