weecology / macroecotools

Tools for Macroecological Analyses Using Python
Other
9 stars 13 forks source link

Upper truncation affects parameter range #29

Closed rueuntal closed 8 years ago

rueuntal commented 8 years ago

Because some of our distributions are upper-truncated, there wouldn't be any issue with convergence, and the parameters could exceed the bounds necessary in the original distributions. For the existing distributions, I think this is an issue for the upper-truncated geometric and the upper-truncated logseries. trunc_logser_solver() accounts for cases where p > 1 but probably is not written in the most elegant or reliable way. None of the functions for the truncated geometric takes into account cases where p < 0.

davharris commented 8 years ago

Not sure I understand. Are there cases for exponential family distributions (eg geometric, negative binomial) where p>1 or p<0 are desirable?

If not, log and logit transforms can automatically constrain the parameter ranges.

Dave

On Feb 15, 2016, at 14:08, Xiao Xiao notifications@github.com wrote:

Because some of our distributions are upper-truncated, there wouldn't be any issue with convergence, and the parameters could exceed the bounds necessary in the original distributions. For the existing distributions, I think this is an issue for the upper-truncated geometric and the upper-truncated logseries. trunc_logser_solver() accounts for cases where p > 1 but probably is not written in the most elegant or reliable way. None of the functions for the truncated geometric takes into account cases where p < 0.

— Reply to this email directly or view it on GitHub.

rueuntal commented 8 years ago

The current version of negative binomial is not upper-truncated (most likely because it would be too difficult to deal with) so this wouldn't be an issue. For the logseries, we've run into cases where the MLE p > 1 when it's upper-truncated. I'm not sure about the geometric, but this may account for some of the convergence problems in the solver, if there have been any.

rueuntal commented 8 years ago

This actually occurred to me while I was trying to implement the logit transformation to logseries - "Why is the upper bound for p at 2 instead of 1? Oh right, because it could be larger than 1." (line 511 in macroeco_distributions) For the SADs that we looked at for the MaxEnt projects p almost always hovers around 1, either below (most cases) or above (only rarely), but I think hypothetically it could take any positive value given the upper bound on abundance.

davharris commented 8 years ago

Still confused, sorry. Are we talking about a truncated version of the PMF on this page? https://en.wikipedia.org/wiki/Negative_binomial_distribution

Negative values of p or of 1-p seem like they'd break all sorts of things.

rueuntal commented 8 years ago

Sorry about the confusion, negbin is fine (and we have the untruncated version). On the other hand p < 0 for the geometric and p > 1 for the logseries would both lead to increasing pmfs with abundance, but they'd still be valid because of the upper-truncation. Does that make a bit more sense?

davharris commented 8 years ago

Still confused. I get negative probabilities when p < 0 for the geometric distribution.

p = -.1 k = seq(0, 10) plot((1 - p)^k * p)

image

rueuntal commented 8 years ago

That's because there is a constant p in the expression. When there's truncation, it's replaced with another constant that ensures proper normalization.

p = -0.1
k = seq(0, 10)
norm_factor = sum((1 - p)^k)
plot((1-p)^k / norm_factor)
davharris commented 8 years ago

I don't personally like the idea of calling that a geometric distribution--it breaks the coin-flipping story behind it and the connection to the binomial and negative binomial. It also isn't obvious to me that the optimization problem is convex anymore, so there might be local optima

Is this common in the literature?

Dave

On Feb 15, 2016, at 14:51, Xiao Xiao notifications@github.com wrote:

That's because there is a constant p in the expression. When there's truncation, it's replaced with another constant that ensures proper normalization.

p = -0.1 k = seq(0, 10) norm_factor = sum((1 - p)^k) plot((1-p)^k / norm_factor) — Reply to this email directly or view it on GitHub.

rueuntal commented 8 years ago

You are right, the distribution defined this way doesn't have the nice Bernoulli interpretation anymore, though I'd like to argue that the pmfs still follow a geometric sequence :)

I don't think it's commonly used in literature for SADs. John Harte's METE predicts upper-truncated logseries, which is probably why we put an upper bound for geometric as well in the first place. Neither got implemented in Elita's paper though, so this is just a more general issue for macroeco_distributions not directly affecting our work at hand.

As for maximum likelihood, I feel pretty sure there is a single solution for each combination of (S, N, Nmax). The code below is a bit messy but hopefully illustrates the point. For simplicity I'm assuming Nmax = N, and the MLE solution is where the function crosses zero.

ll_solver = function(S, N, p){
  out = S/p + S/((1-p)^N - 1) * N * (1-p)^(N - 1) - (N - S) / (1 - p)
  return(out)
}

S = 10
N = 1000
p = seq(-.9999, .9999, .001)
plot(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'red',
     ylim = c(-10000, 10000), xlim = c(-0.1, 0.1))

S = 20
lines(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'green')

S = 100
lines(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'blue')

S = 2
lines(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'purple')

legend('bottomleft', c('S = 2', 'S = 10', 'S = 20', 'S = 100'), 
       col = c('purple', 'red', 'green', 'blue'), lwd = 2)
abline(h = 0, lty = 'dashed')
abline(v = 0, lty = 'dashed')

By the way, how do I insert figures here?

davharris commented 8 years ago

The easiest way to insert a figure is to copy it to your computer’s clipboard and paste it into the text box. Github automatically adds the Markdown incantation for a picture link and uploads the image to a server. Alternatively, if the image is already on a server, you can use Markdown syntax for images (same as the link syntax but with an exclamation point at the beginning).

I don’t mind adding an upper bound for the geometric distribution, but I still think we should use a different word to describe it if we change the bounds on p.

On Feb 15, 2016, at 9:13 PM, Xiao Xiao notifications@github.com wrote:

You are right, the distribution defined this way doesn't have the nice Bernoulli interpretation anymore, though I'd like to argue that the pmfs still follow a geometric sequence :)

I don't think it's commonly used in literature for SADs. John Harte's METE predicts upper-truncated logseries, which is probably why we put an upper bound for geometric as well in the first place. Neither got implemented in Elita's paper though, so this is just a more general issue for macroeco_distributions not directly affecting our work at hand.

As for maximum likelihood, I feel pretty sure there is a single solution for each combination of (S, N, Nmax). The code below is a bit messy but hopefully illustrates the point. For simplicity I'm assuming Nmax = N, and the MLE solution is where the function crosses zero.

ll_solver = function(S, N, p){ out = S/p + S/((1-p)^N - 1) * N * (1-p)^(N - 1) - (N - S) / (1 - p) return(out) }

S = 10 N = 1000 p = seq(-.9999, .9999, .001) plot(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'red', ylim = c(-10000, 10000), xlim = c(-0.1, 0.1))

S = 20 lines(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'green')

S = 100 lines(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'blue')

S = 2 lines(p, ll_solver(S, N, p), pch = 19, type = 'l', lwd = 2, col = 'purple')

legend('bottomleft', c('S = 2', 'S = 10', 'S = 20', 'S = 100'), col = c('purple', 'red', 'green', 'blue'), lwd = 2) abline(h = 0, lty = 'dashed') abline(v = 0, lty = 'dashed') By the way, how do I insert figures here?

— Reply to this email directly or view it on GitHub https://github.com/weecology/macroecotools/issues/29#issuecomment-184476361.

rueuntal commented 8 years ago

Thanks!

I spent some time to look at the behavior of the solver for the truncated geometric and it turned out that I was wrong - p is always > 0 for the well-defined cases where the upper bound is higher than total abundance N and S is at least 2. Assuming that we and others don't throw bizarre values at the function, we don't have to change the bounds on p (and the name of the distribution). What do you think?

And what about the log-series - do you think that having p > 1 not log-series anymore? Any suggestions for transformation when p could be > 1?

davharris commented 8 years ago

Keeping the bounds for the geometric distribution sounds good to me.

I don’t know enough about the log series to have a strong opinion, but I’d lean against using the name when p>1.

On Feb 16, 2016, at 8:47 PM, Xiao Xiao notifications@github.com wrote:

Thanks!

I spent some time to look at the behavior of the solver for the truncated geometric and it turned out that I was wrong - p is always > 0 for the well-defined cases where the upper bound is higher than total abundance N and S is at least 2. Assuming that we and others don't throw bizarre values at the function, we don't have to change the bounds on p (and the name of the distribution). What do you think?

And what about the log-series - do you think that having p > 1 not log-series anymore? Any suggestions for transformation when p could be > 1?

— Reply to this email directly or view it on GitHub https://github.com/weecology/macroecotools/issues/29#issuecomment-184969321.

rueuntal commented 8 years ago

I don't have any objection against giving it a different name, feel free to make changes if you have one in mind! But I wonder if it would be better 1. to make changes to docstrings and keep the function names unchanged (so that it doesn't break the code calling the functions), and 2. not to divide cases with p>1 and p<1 into separate functions (since they are of the same form with no abrupt change in between).

ethanwhite commented 8 years ago

My feeling is that giving p>1 a different name isn't a worthwhile change at the moment. It's an edge case and one you can't really figure out if it applies until you've done the fitting. So, we'd have to have a different name for the truncated distribution, and that ship has kind of sailed for the moment since Harte and everyone building off his work was been referring to it that way. Changing it in the software before some larger discussion of renaming will just make things confusing.

davharris commented 8 years ago

Okay. As long as it's clear when the bounds on p differ from the standard pmf, I have no objections.

rueuntal commented 8 years ago

Thanks, Dave & Ethan! I've add notes to trunc_logser_gen and trunc_logser_solver.