Closed rueuntal closed 8 years ago
My current feelings about pln:
Correctness:
Length/complexity:
LogPMF:
Thanks for the insights! I don't know enough about Gauss-Hermite quadrature to comment, but I'm with you on shorter code is better code.
It would definitely be good to double check. I'll leave it as it is for now, and come back to it when the other more preeminent issues are fixed.
I've read through the code and all of the "code" changes look good to me. I haven't checked the optimizer changes or the equations for the pmfs since @davharris will be able to do a better job with those anyway.
@davharris - when you're happy with this things I think this is ready to go in.
Thanks to both of your for all of your work on this.
I don't see any problems. Should we write a few tests before merging, though?
Sorry for being the time-limiting step here. I did some rudimentary tests with made-up numbers to make sure that the functions returned the same outputs before/after the changes, but it's probably a good idea to do it more systematically. We already have the tests for pmf and cdf of PLN, gleaned from the original paper. @davharris what would you suggest for the other distributions and the solvers? Should we try to replicate the functions in R?
Replicating in R sounds good to me, unless it substantially complicates the testing infrastructure.
Dave
On Mar 7, 2016, at 12:56 PM, Xiao Xiao notifications@github.com wrote:
Sorry for being the time-limiting step here. I did some rudimentary tests with made-up numbers to make sure that the functions returned the same outputs before/after the changes, but it's probably a good idea to do it more systematically. We already have the tests for pmf and cdf of PLN, gleaned from the original paper. @davharris what would you suggest for the other distributions and the solvers? Should we try to replicate the functions in R?
— Reply to this email directly or view it on GitHub.
Looks like there are still numeric problems with the negative binomial solver.
import macroeco_distributions as md
ab = [42L, 21L, 7L, 2L, 21L, 2L, 3L, 1L, 3L, 7L, 2L, 8L, 6L, 7L,
2L, 2L, 1710L, 38L, 8L, 2L, 5L, 72L, 50L, 6L, 208L, 6L, 24L,
4L, 3L, 1L, 2L, 1L, 40L, 52L, 503L, 50L, 47L, 232L, 100L, 30L,
95L, 8L, 4L, 67L]
md.nbinom_lower_trunc_solver(ab)
md.nbinom_lower_trunc_solver(ab)
/Users/davidharris/anaconda/lib/python2.7/site-packages/macroeco_distributions/macroeco_distributions.py:611: RuntimeWarning: overflow encountered in exp
return -nbinom_lower_trunc_ll(ab, exp(x[0]), expit(x[1]))
/Users/davidharris/anaconda/lib/python2.7/site-packages/scipy/stats/_discrete_distns.py:167: RuntimeWarning: invalid value encountered in subtract
coeff = gamln(n+x) - gamln(x+1) - gamln(n)
(6.9645032936894593e-16, 1.8722355604094302e-09)
I don't know why scipy's optimizers get stuck in these low-likelihood areas. My R code finds the optimum at n=1.84667069193733e-06, p=0.00202186468733917, which is about 50 log-likelihood points better. Maybe we need better starting values for these extreme cases?
It does look like the starting values aren't very good for the case I posted in my previous comment. Using the code I linked to, I get a starting value of 0.99887 as the starting value for p0, when the optimum as at p=0.0020.
I've also confirmed that---at least for this abundance vector---I get good results starting p0 closer to 0.001 instead of .99887.
The starting value is consistent with that in fitdistr()
in R though. Given how sensitive this optimization is to the starting value, I'm not sure if fixing the starting value at 0.001 would not cause problems when p is at the other end.
The negative binomial in R also takes a different form (mu, size instead of n, p). Do you think that would make any difference (making it easier or harder for the optimizer)?
Okay, thanks. I was wondering where those starting values came from. To clarify, I wasn't suggesting that we fix the starting value there, just noting that it worked better as a starting value for this particular problem.
Given that the negative binomial is known to be hard to optimize, I wouldn't be opposed to trying several starting values and returning the best set of results.
I don't know which parameterization would be better in general. We could probably figure it out by looking at the condition number, but I forget exactly what that means. Maybe it's the ratio of the first eigenvalue of the Hessian matrix to the second eigenvalue?
I'm afraid that I'm not familiar with the condition number; but having multiple starting values sounds like a brilliant idea. Let me modify the function to include all three values (one computed from data, one close to zero, one close to 1), and see how that works.
Thanks for doing the tests!
On Tue, Mar 8, 2016 at 2:08 PM, David J. Harris notifications@github.com wrote:
Okay, thanks. I was wondering where those starting values came from. To clarify, I wasn't suggesting that we fix the starting value there, just noting that it worked better as a starting value for this particular problem.
Given that the negative binomial is known to be hard to optimize, I wouldn't be opposed to trying several starting values and returning the best set of results.
I don't know which parameterization would be better in general. We could probably figure it out by looking at the condition number, but I forget exactly what that means. Maybe it's the ratio of the first eigenvalue of the Hessian matrix to the second eigenvalue?
— Reply to this email directly or view it on GitHub https://github.com/weecology/macroecotools/pull/39#issuecomment-193920436 .
@davharris - I think the new function works for the example you posted. If you run into other issues, please let me know.
@ethanwhite - what's the level of robustness we'd want in tests for this module? Do we want to make sure that no error is introduced with each update (then having 5 or so random values for each distribution/solver/likelihood function from R is probably fine)? Or do we want to check the extreme cases as well (in which case we'd probably want to add the cases from sad-comparisons
that caused us trouble)?
@rueuntal If we can fix #40, we might not need multiple starting values. Multiple starting values certainly wouldn't hurt, though.
@rueuntal I vote for including some extreme cases in the test suite.
I added three simple test cases for the pmf/pdf and cdf of each distribution. These are more like "sanity checks", making sure that the distributions behave as expected if/when we make changes. All tests have passed.
There are still no tests on the solvers though, which are probably more likely to be broken by future updates. I'm hoping that @davharris has already coded up some of them in R so that I don't have to duplicate the work :) Dave would you mind putting a few tests in if you already have the code?
@davharris - can you give this PR and discussion another look through and see if it's good to go in. Since it's the only open PR it would be nice if we could merge it before tagging 0.3 and publishing on Zenodo.
In case cleaning is important
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R616
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R537
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R236
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R242
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R239
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R236
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R233
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R212
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R607
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R608
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R616
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R173
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R232
to
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R242
It would be nice to include a default return. but it is also perfect assuming only these two cases happen.
def _argcheck(self, mu, sigma, lower_trunc):
if lower_trunc is True:
self.a = 1
else:
self.a = 0
return sigma > 0
Hi @henrykironde - what do you mean by a default return?
sorry I missed a link https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R173
def _logpmf(self, x, p, upper_bound):
x = np.array(x)
if p[0] < 1:
return stats.logser.logpmf(x, p) - stats.logser.logcdf(upper_bound, p)
else:
ivals = np.arange(1, upper_bound[0] + 1)
normalization = sum(p[0] ** ivals / ivals)
logpmf = x * log(p[0]) - log(x) - log(normalization)
return logpmf
def logpmf(p):
if p < 1:
return "yes"
else:
return "not"
return "this value may need to be debugged"
#test drive
logpmf(None)
But this may not be important if you are sure that p[0] will always be a number
Ah I see. Are you trying to catch the case where any of the inputs are not numeric?
In that case I'd be inclined to keep the current version, which throws a TypeError if that happens (this is also the style of scipy.stats
).
Wow, I haven't thought about this in a long time. I'll see what I can do.
Okay, here are my thoughts:
stats
package in R, so it would be good to know which function was used. Even better, could the tables be generated in R and saved as a CSV or something? Copy-pasting from the R console to the test suite makes me a little jumpy.test_distribution_Xdf
functions like test_nbinom_lower_trunc_cdf
and test_trunc_weibull_pdf
. Could some of that be abstracted away? It would be a lot less brittle and much easier to read.I think these changes might work well together (e.g. if all the R results are in a single table, it might be easier to write the tests with a single function).
Hi @davharris - Good call. Let me get those fixed.
Sorry I'm not familiar with Travis CI. What does the error message below mean?
@rueuntal, I think that is just Travis failing occasionally. I don't see anything that you changed leading to this failure.
try git commit --amend
save and force push with -f. git push -f ......
...... represents whatever branch you where pushing eg. git push origin master -> git push -f origin master
I’ve restarted Travis’s tests, so I don’t think the amend/force push will be necessary.
On Aug 11, 2016, at 1:17 AM, Henry Senyondo notifications@github.com wrote:
@rueuntal https://github.com/rueuntal, I think that is just Travis failing occasionally. I don't see anything that you changed leading to this failure.
try git commit --amend save and force push with -f. git push -f ...... ...... represents whatever branch you where pushing eg. git push origin master -> git push -f origin master
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weecology/macroecotools/pull/39#issuecomment-239075814, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzdCefW0RHy9hkBVESL9WM5LwOHt1OLks5qerCHgaJpZM4Hj93W.
Nope. Still failing. Not sure if this is the reason or not, but it might be related: https://www.traviscistatus.com/incidents/11hp8bhkrkn7 https://www.traviscistatus.com/incidents/11hp8bhkrkn7
On Aug 11, 2016, at 1:20 AM, Dave Harris harry491@gmail.com wrote:
I’ve restarted Travis’s tests, so I don’t think the amend/force push will be necessary.
On Aug 11, 2016, at 1:17 AM, Henry Senyondo <notifications@github.com mailto:notifications@github.com> wrote:
@rueuntal https://github.com/rueuntal, I think that is just Travis failing occasionally. I don't see anything that you changed leading to this failure.
try git commit --amend save and force push with -f. git push -f ...... ...... represents whatever branch you where pushing eg. git push origin master -> git push -f origin master
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weecology/macroecotools/pull/39#issuecomment-239075814, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzdCefW0RHy9hkBVESL9WM5LwOHt1OLks5qerCHgaJpZM4Hj93W.
yap, I did restart it too and it failed. Let me check the last time this repo was run. cos you mentioning sudo, reminds me that there were some changes on how and when to use sudo
okey i think this is it
based on Error: No packages found matching: conda-env >=2.5,<2.6
https://www.traviscistatus.com/incidents/11hp8bhkrkn7
If that is not the case we may look at this later https://github.com/conda/conda/issues/2822
Thanks everyone. I tried the simple fix for: https://www.traviscistatus.com/incidents/11hp8bhkrkn7
in #41, but that didn't work.
It looks to me like the instructions for using conda have changed a fair bit since we set it up in this repo: http://conda.pydata.org/docs/travis.html
So the .travis.yml
probably needs to be updated.
This is probably a super simple bug - what's the directory for the input test_data.csv file? Neither 'test_data.csv'
nor './test_data.csv'
worked.
Very cool. Do we still want to add tests for the solvers before merging?
This is probably a super simple bug - what's the directory for the input test_data.csv file? Neither 'test_data.csv' nor './test_data.csv' worked.
My guess is that you want macroeco_distributions/test_data.csv
.
Great! The checks have finally passed, thanks everyone!
Do we still want to add tests for the solvers before merging?
What's your feeling for it?
Seems like a good idea. Could also be a big rabbit hole if we get too worried about edge cases like #40. I'm leaning yet. But I don't think it has to be on this PR, so I think we can merge.
Yeah I agree, especially given python's suboptimal behavior for optimization.
I'll go ahead and merge this PR then, and start an issue to discuss our options regarding the solvers.
PLN has a more complicated form than the others, and it will take me some more time to figure out if it's worthwhile to add the logpmf.