weecology / macroecotools

Tools for Macroecological Analyses Using Python
Other
9 stars 13 forks source link

Bug in pln? #7

Closed mqwilber closed 10 years ago

mqwilber commented 10 years ago

pln.pmf(0, 4.3, 100, 0) gives the following

../quadpack.py:295: UserWarning: The maximum number of subdivisions (100) has been achieved. If increasing the limit yields no improvement it is advised to analyze the integrand in order to determine the difficulties. If the position of a local difficulty can be determined (singularity, discontinuity) one will probably gain from splitting up the interval and calling the integrator on the subranges. Perhaps a special-purpose integrator should be used. warnings.warn(msg)

With a result of 0.39006074199175289. This is not in agreement with the macroeco package plnorm.pmf or dpolono from VGAM in R.

Also, pln.cdf([1,2], 4, 4, 0) raises

ValueError: operands could not be broadcast together with shapes (2) (3)

Just thought you guys might like to know!

ethanwhite commented 10 years ago

@rueuntal can you take a look at this when you get a few minutes?

ethanwhite commented 10 years ago

Thanks for the report @mqwilber!

Are you using the Grundy Biometrika implementation as well? Our tests against the tables in that paper are passing, so I'm just trying to help narrow in on whether this is a type of case that isn't being exercised by the tests or a difference in algorithm.

ethanwhite commented 10 years ago

Also, @mqwilber, is there just a difference when zero is used for the lower truncation (something we haven't worked with much) or is there a difference with one for the lower truncation as well? I tried to check it myself but after building the develop branch of macroeco I get errors on import.

ethanwhite commented 10 years ago

OK, so after a little exploration comparing our results to those in VGAM (because I'm kind of obsessive about bugs and I can't help myself) it looks like this is a probably a numerical precision issue at extreme values of the variance. We start to see small divergences from VGAM when the standard deviation on the log scale is large (in the case of the given mean, about 50, which is equivalent to a variance of the distribution ~2.5*10^43), and the mean is small relative to the variance (having a std_log of 50 is fine if the mean_log is >25).

mqwilber commented 10 years ago

We are using the Bulmer implementation, but we are unit testing against Grundy values (and passing) as well as against the your implementation and the R functions dpoilog and dpolono. They are all matching up except for those extreme values. It looks like I get a similar divergence and warning message for pln.pmf(1, 4.3, 100, 1):

quadpack.py:295: UserWarning: The maximum number of subdivisions (100) has been achieved. If increasing the limit yields no improvement it is advised to analyze the integrand in order to determine the difficulties. If the position of a local difficulty can be determined (singularity, discontinuity) one will probably gain from splitting up the interval and calling the integrator on the subranges. Perhaps a special-purpose integrator should be used. warnings.warn(msg) Out[13]: 0.0065323785347902903

VGAM polono (with a truncation) and macroeco are giving 0.0076703646892546141. As you said, it is a pretty extreme test case.

@ethanwhite, could you submit the macroeco import error as an issue on jkitzes/macroeco?

rueuntal commented 10 years ago

increasing the max number of subdivisions seems to partially fix the issue, e.g., changing limit in integrate.quad from 100 to 500 yields pln.pmf(1, 4.3, 100, 1) = 0.007662314604567174.

I will try to implement mpmath and see if it provides a better fix.

ethanwhite commented 10 years ago

As you said, it is a pretty extreme test case.

Yes, thankfully not something anyone should have run an analysis on, but definitely good to clean up nonetheless.

@ethanwhite, could you submit the macroeco import error as an issue on jkitzes/macroeco?

Way ahead of you :) https://github.com/jkitzes/macroeco/issues/70

I have a rule that if I find a bug I always report it. That's how we make all of the code better, so thanks again for reporting our issue.

increasing the max number of subdivisions seems to partially fix the issue, e.g., changing limit in integrate.quad from 100 to 500 yields pln.pmf(1, 4.3, 100, 1) = 0.007662314604567174.

I will try to implement mpmath and see if it provides a better fix.

Thanks Xiao!

rueuntal commented 10 years ago

Strange, switching to mpmath actually yields worse results:

>>>pln.pmf(0, 4.3, 100, 0)
0.158111050065873125
>>>pln.pmf(1, 4.3, 100, 1)
0.0047326331064326717

I will go ahead and change the max number of subdivisions to 500 then (which would still raise a warning), unless other folks have a better idea in mind?

rueuntal commented 10 years ago

The bug in pln.cdf is now fixed. Thanks for reporting @mqwilber.

ethanwhite commented 10 years ago

I will go ahead and change the max number of subdivisions to 500 then (which would still raise a warning), unless other folks have a better idea in mind?

We should probably also take a look at how our implementation differs from macreco since we're implementing the same algorithm (despite my confused comment earlier): https://github.com/jkitzes/macroeco/blob/develop/macroeco/models/_distributions.py#L729

rueuntal commented 10 years ago

The problem is now fixed with a transformation in pln.gen(), borrowed from macroeco (https://github.com/jkitzes/macroeco/blob/develop/macroeco/models/_distributions.py#L729). Thanks again @mqwilber !

ethanwhite commented 10 years ago

Thanks for fixing this @rueuntal! And thanks again @mqwilber for reporting the issue!