Open martinjankowiak opened 3 years ago
Hi @martinjankowiak, thank you for the bug report. Im glad to hear you have found the library useful for sampling.
This is definitely unexpected behaviour. From my quick analysis, it appears that the implementation breaks down for sufficiently large values of the scale parameter h
. Here is a plot of the pdf (optained by calling exp()
on the result of the logpdf) for h=100
compared to much smaller values:
As can be seen, for values of x just above 25, things suddenly go wrong. I did a quick print-statement debug of the C function and as of now the problem occurs on this line:
https://github.com/zoj613/polyagamma/blob/da31ac8d84c1363d349bf030fe72d0d53af19acb/src/pgm_density.c#L117
Here is the output of the line when using the input you provided:
i=0, curr=309.346381
i=1, sum=-12.647341
i=2, sum=77.701459
i=3, sum=-309.133375
i=4, sum=895.831478
i=5, sum=-2016.643901
i=6, sum=3672.975365
i=7, sum=-5566.288155
i=8, sum=7164.219293
i=9, sum=-7954.477246
i=10, sum=7713.228093
i=11, sum=-6597.153887
i=12, sum=5017.974055
i=13, sum=-3417.677855
i=14, sum=2096.509822
i=15, sum=-1164.135168
i=16, sum=587.682449
i=17, sum=-270.753110
i=18, sum=114.224948
i=19, sum=-44.259741
i=20, sum=15.793703
i=21, sum=-5.202792
i=22, sum=1.585683
i=23, sum=-0.448009
i=24, sum=0.117553
i=25, sum=-0.028693
i=26, sum=0.006525
i=27, sum=-0.001384
i=28, sum=0.000274
i=29, sum=-0.000051
i=30, sum=0.000009
i=31, sum=-0.000001
i=32, sum=0.000000
i=33, sum=-0.000000
i=34, sum=0.000000
i=35, sum=0.000000
i=36, sum=0.000000
i=37, sum=0.000000
i=38, sum=0.000000
i=39, sum=0.000000
i=40, sum=0.000000
The sum eventually vanishes after many iterations. It doesn't help that there is a division by h
on that line, which I suspect causes underflow for when h is big enough compared to the numerator. After exiting the loop, the logarithm function is called with sum
, which then returns -inf
.
As a quick workaround, I wrote a function that calculates the logpdf using normal approximation. Since h
is large, I suspect that its accuracy for computing the density wont be too bad?
import math
def polyagamma_logpdf_large_h(x, h=1, z=0):
if z == 0:
mean = 0.25 * h
var = 0.041666688 * h
else:
x = math.tanh(0.5 * z)
mean = 0.5 * h * x / z
var = 0.25 * h * (math.sinh(z) - z) * (1 - x * x) / (z ** 3)
return -math.log(math.sqrt(var)) - 0.5 * math.log(2 * math.pi) - 0.5 * (x - mean) ** 2 / var
and using it with the input,we get:
>>> polyagamma_logpdf_large_h(25.80653973, h=100.0)
-1.7105576873858082
>>> polyagamma_logpdf_large_h(25.10653973, h=100.0)
-1.6338590520155094
Below is the pdf plot of the above normal approximation vs the pdf as it stands in the main branch for large scale parameter values and z=0
:
I will have to investigate the real cause of numerical instability because I am sure it happens with other combinations of (h
, z
).
Also worth noting that this behaviour happens even if we call polyagamma_pdf
without the return_log
flag:
Also note that for larger values of the exponential tilting parameter (e.g z>=5
), the numerical instability of the pdf seems to go away:
@martinjankowiak I submitted a re-write at #81 that addresses part of the issue and ensures nans are not returned. However the instability still persists for large h
when z=0
, leading to the wrong value being returned. For z >> 0
This seems to not be an issue, for example:
Do you have suggestions on how to address this? The implementation seems to break down at around x>=20 for very small values of z
parameter. I have tried all the obvious tricks I know of to deal with instability for extreme values but nothing works. I'm interested in what use case do you require z=0
?. In most applications I have seen z
tends to be reasonably larger than 1. So I suspect that many users who use this library for computing the logpdf will not run into this issue.
@zoj613 thanks for taking a closer look at this! for my present use case i can actually compute the necessary MH ratio without explicitly evaluating the density. so i myself don't have a pressing need for this tricky regime at present.
i'm not sure how best to address the remaining instabilities as i haven't stared at the necessary formulae long enough. for what it's worth when i implemented a somewhat rough approximation to PG(1, 0) in pyro i found it helpful to group even and odd terms together. i have no idea if that would be helpful here. i guess one place to look for ideas might be generic references about summing alternating series.
I considered a saddle point approximation of the density as a viable solution for large values of (x, h)
. I came to the conclusion that it might not be the best choice. The derivative of the cumulant function of a PG(h, 0)
is
K'(t) = 0.25 * h * tanh(sqrt(-0.5 * t)) / sqrt(-0.5 * t).
When computing the saddle point approximation, we have to solve for t
in K'(t) = x
. I noticed that for x >= 0.25 * h
there is no real solution since tanh(x) / x
has range (0, 1)
. This might be related to why the numerical instubilities manifest at around x=25
when h=100
in the plots above. The math could be wrong, of course. I'm hoping that someone more knowledgeable might chime in and offer some other alternatives.
hello @zoj613 thanks for the great library! i've been able to use it for sampling with great ease. recently i've been looking at using log pdfs to compute MH ratios and the like. i am encountering nans in strange places. for example:
yields
have you encountered anything like this? is the non-log version more stable and better tested?
i'm using
pip install --pre -U polyagamma
and thus version1.3.1
.incidentally, would it be possible to support some sort of precision argument to the pdf methods so that users can make trade offs between speed and accuracy? or otherwise just dictate a fixed number of terms in the series to use? i'm not sure if that can help with parallelization....