weecology / macroecotools

Tools for Macroecological Analyses Using Python
Other
9 stars 13 forks source link

Add logpmf/pdf for all distributions except for PLN #39

Closed rueuntal closed 8 years ago

rueuntal commented 8 years ago

PLN has a more complicated form than the others, and it will take me some more time to figure out if it's worthwhile to add the logpmf.

davharris commented 8 years ago

My current feelings about pln:

Correctness:

Length/complexity:

LogPMF:

rueuntal commented 8 years ago

Thanks for the insights! I don't know enough about Gauss-Hermite quadrature to comment, but I'm with you on shorter code is better code.

It would definitely be good to double check. I'll leave it as it is for now, and come back to it when the other more preeminent issues are fixed.

ethanwhite commented 8 years ago

I've read through the code and all of the "code" changes look good to me. I haven't checked the optimizer changes or the equations for the pmfs since @davharris will be able to do a better job with those anyway.

@davharris - when you're happy with this things I think this is ready to go in.

Thanks to both of your for all of your work on this.

davharris commented 8 years ago

I don't see any problems. Should we write a few tests before merging, though?

rueuntal commented 8 years ago

Sorry for being the time-limiting step here. I did some rudimentary tests with made-up numbers to make sure that the functions returned the same outputs before/after the changes, but it's probably a good idea to do it more systematically. We already have the tests for pmf and cdf of PLN, gleaned from the original paper. @davharris what would you suggest for the other distributions and the solvers? Should we try to replicate the functions in R?

davharris commented 8 years ago

Replicating in R sounds good to me, unless it substantially complicates the testing infrastructure.

Dave

On Mar 7, 2016, at 12:56 PM, Xiao Xiao notifications@github.com wrote:

Sorry for being the time-limiting step here. I did some rudimentary tests with made-up numbers to make sure that the functions returned the same outputs before/after the changes, but it's probably a good idea to do it more systematically. We already have the tests for pmf and cdf of PLN, gleaned from the original paper. @davharris what would you suggest for the other distributions and the solvers? Should we try to replicate the functions in R?

— Reply to this email directly or view it on GitHub.

davharris commented 8 years ago

Looks like there are still numeric problems with the negative binomial solver.

import macroeco_distributions as md

ab = [42L, 21L, 7L, 2L, 21L, 2L, 3L, 1L, 3L, 7L, 2L, 8L, 6L, 7L, 
2L, 2L, 1710L, 38L, 8L, 2L, 5L, 72L, 50L, 6L, 208L, 6L, 24L, 
4L, 3L, 1L, 2L, 1L, 40L, 52L, 503L, 50L, 47L, 232L, 100L, 30L, 
95L, 8L, 4L, 67L]

md.nbinom_lower_trunc_solver(ab)
md.nbinom_lower_trunc_solver(ab)
/Users/davidharris/anaconda/lib/python2.7/site-packages/macroeco_distributions/macroeco_distributions.py:611: RuntimeWarning: overflow encountered in exp
  return -nbinom_lower_trunc_ll(ab, exp(x[0]), expit(x[1]))
/Users/davidharris/anaconda/lib/python2.7/site-packages/scipy/stats/_discrete_distns.py:167: RuntimeWarning: invalid value encountered in subtract
  coeff = gamln(n+x) - gamln(x+1) - gamln(n)
(6.9645032936894593e-16, 1.8722355604094302e-09)

I don't know why scipy's optimizers get stuck in these low-likelihood areas. My R code finds the optimum at n=1.84667069193733e-06, p=0.00202186468733917, which is about 50 log-likelihood points better. Maybe we need better starting values for these extreme cases?

davharris commented 8 years ago

It does look like the starting values aren't very good for the case I posted in my previous comment. Using the code I linked to, I get a starting value of 0.99887 as the starting value for p0, when the optimum as at p=0.0020.

I've also confirmed that---at least for this abundance vector---I get good results starting p0 closer to 0.001 instead of .99887.

rueuntal commented 8 years ago

The starting value is consistent with that in fitdistr() in R though. Given how sensitive this optimization is to the starting value, I'm not sure if fixing the starting value at 0.001 would not cause problems when p is at the other end.

The negative binomial in R also takes a different form (mu, size instead of n, p). Do you think that would make any difference (making it easier or harder for the optimizer)?

davharris commented 8 years ago

Okay, thanks. I was wondering where those starting values came from. To clarify, I wasn't suggesting that we fix the starting value there, just noting that it worked better as a starting value for this particular problem.

Given that the negative binomial is known to be hard to optimize, I wouldn't be opposed to trying several starting values and returning the best set of results.

I don't know which parameterization would be better in general. We could probably figure it out by looking at the condition number, but I forget exactly what that means. Maybe it's the ratio of the first eigenvalue of the Hessian matrix to the second eigenvalue?

rueuntal commented 8 years ago

I'm afraid that I'm not familiar with the condition number; but having multiple starting values sounds like a brilliant idea. Let me modify the function to include all three values (one computed from data, one close to zero, one close to 1), and see how that works.

Thanks for doing the tests!

On Tue, Mar 8, 2016 at 2:08 PM, David J. Harris notifications@github.com wrote:

Okay, thanks. I was wondering where those starting values came from. To clarify, I wasn't suggesting that we fix the starting value there, just noting that it worked better as a starting value for this particular problem.

Given that the negative binomial is known to be hard to optimize, I wouldn't be opposed to trying several starting values and returning the best set of results.

I don't know which parameterization would be better in general. We could probably figure it out by looking at the condition number, but I forget exactly what that means. Maybe it's the ratio of the first eigenvalue of the Hessian matrix to the second eigenvalue?

— Reply to this email directly or view it on GitHub https://github.com/weecology/macroecotools/pull/39#issuecomment-193920436 .

rueuntal commented 8 years ago

@davharris - I think the new function works for the example you posted. If you run into other issues, please let me know.

@ethanwhite - what's the level of robustness we'd want in tests for this module? Do we want to make sure that no error is introduced with each update (then having 5 or so random values for each distribution/solver/likelihood function from R is probably fine)? Or do we want to check the extreme cases as well (in which case we'd probably want to add the cases from sad-comparisons that caused us trouble)?

davharris commented 8 years ago

@rueuntal If we can fix #40, we might not need multiple starting values. Multiple starting values certainly wouldn't hurt, though.

davharris commented 8 years ago

@rueuntal I vote for including some extreme cases in the test suite.

rueuntal commented 8 years ago

I added three simple test cases for the pmf/pdf and cdf of each distribution. These are more like "sanity checks", making sure that the distributions behave as expected if/when we make changes. All tests have passed.

There are still no tests on the solvers though, which are probably more likely to be broken by future updates. I'm hoping that @davharris has already coded up some of them in R so that I don't have to duplicate the work :) Dave would you mind putting a few tests in if you already have the code?

ethanwhite commented 8 years ago

@davharris - can you give this PR and discussion another look through and see if it's good to go in. Since it's the only open PR it would be nice if we could merge it before tagging 0.3 and publishing on Zenodo.

henrykironde commented 8 years ago

In case cleaning is important

https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R616
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R537
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R236
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R242
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R239
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R236
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R233
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R212
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R607
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R608
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R616
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R173
https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R232

to 

https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R242

It would be nice to include a default return. but it is also perfect assuming only these two cases happen.

def _argcheck(self, mu, sigma, lower_trunc):
    if lower_trunc is True:
        self.a = 1
    else:
        self.a = 0
    return sigma > 0
rueuntal commented 8 years ago

Hi @henrykironde - what do you mean by a default return?

henrykironde commented 8 years ago

sorry I missed a link https://github.com/weecology/macroecotools/pull/39/files#diff-0c22aa27a8370a07c9f00d7c8e7f53a2R173


    def _logpmf(self, x, p, upper_bound):
        x = np.array(x)
        if p[0] < 1:
            return stats.logser.logpmf(x, p) - stats.logser.logcdf(upper_bound, p)
        else:
            ivals = np.arange(1, upper_bound[0] + 1)
            normalization = sum(p[0] ** ivals / ivals)
            logpmf = x * log(p[0]) - log(x) - log(normalization)
            return logpmf
def logpmf(p):

    if p < 1:
        return "yes"
    else:
        return "not"
    return "this value may need to be debugged"

#test drive
logpmf(None)

But this may not be important if you are sure that p[0] will always be a number

rueuntal commented 8 years ago

Ah I see. Are you trying to catch the case where any of the inputs are not numeric?

In that case I'd be inclined to keep the current version, which throws a TypeError if that happens (this is also the style of scipy.stats).

davharris commented 8 years ago

Wow, I haven't thought about this in a long time. I'll see what I can do.

davharris commented 8 years ago

Okay, here are my thoughts:

I think these changes might work well together (e.g. if all the R results are in a single table, it might be easier to write the tests with a single function).

rueuntal commented 8 years ago

Hi @davharris - Good call. Let me get those fixed.

rueuntal commented 8 years ago

Sorry I'm not familiar with Travis CI. What does the error message below mean?

henrykironde commented 8 years ago

@rueuntal, I think that is just Travis failing occasionally. I don't see anything that you changed leading to this failure.

try git commit --amend save and force push with -f. git push -f ...... ...... represents whatever branch you where pushing eg. git push origin master -> git push -f origin master

davharris commented 8 years ago

I’ve restarted Travis’s tests, so I don’t think the amend/force push will be necessary.

On Aug 11, 2016, at 1:17 AM, Henry Senyondo notifications@github.com wrote:

@rueuntal https://github.com/rueuntal, I think that is just Travis failing occasionally. I don't see anything that you changed leading to this failure.

try git commit --amend save and force push with -f. git push -f ...... ...... represents whatever branch you where pushing eg. git push origin master -> git push -f origin master

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weecology/macroecotools/pull/39#issuecomment-239075814, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzdCefW0RHy9hkBVESL9WM5LwOHt1OLks5qerCHgaJpZM4Hj93W.

davharris commented 8 years ago

Nope. Still failing. Not sure if this is the reason or not, but it might be related: https://www.traviscistatus.com/incidents/11hp8bhkrkn7 https://www.traviscistatus.com/incidents/11hp8bhkrkn7

On Aug 11, 2016, at 1:20 AM, Dave Harris harry491@gmail.com wrote:

I’ve restarted Travis’s tests, so I don’t think the amend/force push will be necessary.

On Aug 11, 2016, at 1:17 AM, Henry Senyondo <notifications@github.com mailto:notifications@github.com> wrote:

@rueuntal https://github.com/rueuntal, I think that is just Travis failing occasionally. I don't see anything that you changed leading to this failure.

try git commit --amend save and force push with -f. git push -f ...... ...... represents whatever branch you where pushing eg. git push origin master -> git push -f origin master

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weecology/macroecotools/pull/39#issuecomment-239075814, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzdCefW0RHy9hkBVESL9WM5LwOHt1OLks5qerCHgaJpZM4Hj93W.

henrykironde commented 8 years ago

yap, I did restart it too and it failed. Let me check the last time this repo was run. cos you mentioning sudo, reminds me that there were some changes on how and when to use sudo

henrykironde commented 8 years ago

okey i think this is it based on Error: No packages found matching: conda-env >=2.5,<2.6 https://www.traviscistatus.com/incidents/11hp8bhkrkn7

If that is not the case we may look at this later https://github.com/conda/conda/issues/2822

ethanwhite commented 8 years ago

Thanks everyone. I tried the simple fix for: https://www.traviscistatus.com/incidents/11hp8bhkrkn7

in #41, but that didn't work.

It looks to me like the instructions for using conda have changed a fair bit since we set it up in this repo: http://conda.pydata.org/docs/travis.html

So the .travis.yml probably needs to be updated.

rueuntal commented 8 years ago

This is probably a super simple bug - what's the directory for the input test_data.csv file? Neither 'test_data.csv' nor './test_data.csv' worked.

davharris commented 8 years ago

Very cool. Do we still want to add tests for the solvers before merging?

ethanwhite commented 8 years ago

This is probably a super simple bug - what's the directory for the input test_data.csv file? Neither 'test_data.csv' nor './test_data.csv' worked.

My guess is that you want macroeco_distributions/test_data.csv.

rueuntal commented 8 years ago

Great! The checks have finally passed, thanks everyone!

Do we still want to add tests for the solvers before merging?

What's your feeling for it?

davharris commented 8 years ago

Seems like a good idea. Could also be a big rabbit hole if we get too worried about edge cases like #40. I'm leaning yet. But I don't think it has to be on this PR, so I think we can merge.

rueuntal commented 8 years ago

Yeah I agree, especially given python's suboptimal behavior for optimization.

I'll go ahead and merge this PR then, and start an issue to discuss our options regarding the solvers.