BUG: density function returns NaN for certain values of the parameters.

martinjankowiak commented 3 years ago

hello @zoj613 thanks for the great library! i've been able to use it for sampling with great ease. recently i've been looking at using log pdfs to compute MH ratios and the like. i am encountering nans in strange places. for example:

from polyagamma import polyagamma_pdf

x = polyagamma_pdf(25.80653973, 100.0, return_log=1)
y = polyagamma_pdf(25.10653973, 100.0, return_log=1)
print("x", x)
print("y", y)

yields

x nan
y -1.766024369081947

have you encountered anything like this? is the non-log version more stable and better tested?

i'm using pip install --pre -U polyagamma and thus version 1.3.1.

incidentally, would it be possible to support some sort of precision argument to the pdf methods so that users can make trade offs between speed and accuracy? or otherwise just dictate a fixed number of terms in the series to use? i'm not sure if that can help with parallelization....

zoj613 commented 3 years ago

Hi @martinjankowiak, thank you for the bug report. Im glad to hear you have found the library useful for sampling.

This is definitely unexpected behaviour. From my quick analysis, it appears that the implementation breaks down for sufficiently large values of the scale parameter h. Here is a plot of the pdf (optained by calling exp() on the result of the logpdf) for h=100 compared to much smaller values: pdf As can be seen, for values of x just above 25, things suddenly go wrong. I did a quick print-statement debug of the C function and as of now the problem occurs on this line: https://github.com/zoj613/polyagamma/blob/da31ac8d84c1363d349bf030fe72d0d53af19acb/src/pgm_density.c#L117 Here is the output of the line when using the input you provided:

i=0, curr=309.346381
i=1, sum=-12.647341
i=2, sum=77.701459
i=3, sum=-309.133375
i=4, sum=895.831478
i=5, sum=-2016.643901
i=6, sum=3672.975365
i=7, sum=-5566.288155
i=8, sum=7164.219293
i=9, sum=-7954.477246
i=10, sum=7713.228093
i=11, sum=-6597.153887
i=12, sum=5017.974055
i=13, sum=-3417.677855
i=14, sum=2096.509822
i=15, sum=-1164.135168
i=16, sum=587.682449
i=17, sum=-270.753110
i=18, sum=114.224948
i=19, sum=-44.259741
i=20, sum=15.793703
i=21, sum=-5.202792
i=22, sum=1.585683
i=23, sum=-0.448009
i=24, sum=0.117553
i=25, sum=-0.028693
i=26, sum=0.006525
i=27, sum=-0.001384
i=28, sum=0.000274
i=29, sum=-0.000051
i=30, sum=0.000009
i=31, sum=-0.000001
i=32, sum=0.000000
i=33, sum=-0.000000
i=34, sum=0.000000
i=35, sum=0.000000
i=36, sum=0.000000
i=37, sum=0.000000
i=38, sum=0.000000
i=39, sum=0.000000
i=40, sum=0.000000

The sum eventually vanishes after many iterations. It doesn't help that there is a division by h on that line, which I suspect causes underflow for when h is big enough compared to the numerator. After exiting the loop, the logarithm function is called with sum, which then returns -inf.

As a quick workaround, I wrote a function that calculates the logpdf using normal approximation. Since h is large, I suspect that its accuracy for computing the density wont be too bad?

import math

def polyagamma_logpdf_large_h(x, h=1, z=0):
    if z == 0:
        mean = 0.25 * h
        var = 0.041666688 * h
    else:
        x = math.tanh(0.5 * z)
        mean = 0.5 * h * x / z
        var = 0.25 * h * (math.sinh(z) - z) * (1 - x * x) / (z ** 3)

    return -math.log(math.sqrt(var)) - 0.5 * math.log(2 * math.pi) - 0.5 * (x - mean) ** 2 / var

and using it with the input,we get:

>>> polyagamma_logpdf_large_h(25.80653973, h=100.0)
-1.7105576873858082
>>> polyagamma_logpdf_large_h(25.10653973, h=100.0)
-1.6338590520155094

Below is the pdf plot of the above normal approximation vs the pdf as it stands in the main branch for large scale parameter values and z=0: pg_vs_norm_main

I will have to investigate the real cause of numerical instability because I am sure it happens with other combinations of (h, z).

zoj613 commented 3 years ago

Also worth noting that this behaviour happens even if we call polyagamma_pdf without the return_log flag: pdf

Also note that for larger values of the exponential tilting parameter (e.g z>=5), the numerical instability of the pdf seems to go away: pdf

zoj613 commented 3 years ago

@martinjankowiak I submitted a re-write at #81 that addresses part of the issue and ensures nans are not returned. However the instability still persists for large h when z=0, leading to the wrong value being returned. For z >> 0 This seems to not be an issue, for example: Figure_1

Do you have suggestions on how to address this? The implementation seems to break down at around x>=20 for very small values of z parameter. I have tried all the obvious tricks I know of to deal with instability for extreme values but nothing works. I'm interested in what use case do you require z=0?. In most applications I have seen z tends to be reasonably larger than 1. So I suspect that many users who use this library for computing the logpdf will not run into this issue.

martinjankowiak commented 3 years ago

@zoj613 thanks for taking a closer look at this! for my present use case i can actually compute the necessary MH ratio without explicitly evaluating the density. so i myself don't have a pressing need for this tricky regime at present.

i'm not sure how best to address the remaining instabilities as i haven't stared at the necessary formulae long enough. for what it's worth when i implemented a somewhat rough approximation to PG(1, 0) in pyro i found it helpful to group even and odd terms together. i have no idea if that would be helpful here. i guess one place to look for ideas might be generic references about summing alternating series.

zoj613 commented 3 years ago

I considered a saddle point approximation of the density as a viable solution for large values of (x, h). I came to the conclusion that it might not be the best choice. The derivative of the cumulant function of a PG(h, 0) is

K'(t) = 0.25 * h * tanh(sqrt(-0.5 * t)) / sqrt(-0.5 * t).

When computing the saddle point approximation, we have to solve for t in K'(t) = x. I noticed that for x >= 0.25 * h there is no real solution since tanh(x) / x has range (0, 1). This might be related to why the numerical instubilities manifest at around x=25 when h=100 in the plots above. The math could be wrong, of course. I'm hoping that someone more knowledgeable might chime in and offer some other alternatives.

zoj613 / polyagamma

BUG: density function returns NaN for certain values of the parameters. #80