[Discussion] Expansion of Distributions and IDistribution Interface

yellowstonegames / SquidLib

Useful tools for roguelike, role-playing, strategy, and other grid-based games in Java. Feedback is welcome!

Other

448 stars 46 forks source link

[Discussion] Expansion of Distributions and IDistribution Interface #209

Open aus101 opened 2 years ago

aus101 commented 2 years ago

I'd really like to see (discrete) Poisson and Binomial distributions due their ubiquity in probability. The cumulative distribution of Poisson can be expressed in terms of the (continuous) Gamma distribution. Would imply adding fast factorial, log and exp(-x) calls. Poisson distribution is easy with lambda <= 20 for indexing a factorial that can be stored as a long. Poisson time between events is the Exponential Distribution that you already have.
For expansion of IDistribution, would be easy and useful to have returns of the first four moments - mean, variance, skewness and excess kurtosis of the respective distribution. I'd like to see 90/95/99 confidence intervals but that gets debatable on which exact method to use for each distribution. No perfect solution, even exact binomial calculation is overly conservative, i.e., 95% CI is more like 99%. At least moments and CI can be pre-computed on distribution construction and permanently fixed to the existing Gaussian since it's the normalized N( 0 , 1 ) of mean of 0 and variance of 1 with higher moments of 0.
Could have helper function for each distribution that take in an array or list of input and return the requested moment, assuming that distribution was the only source.
The ubiquitous (continuous) Student's t-distribution is computationally a nightmare but can take easy way out by limiting for confidence intervals and those are easily pre-computed for each degree of freedom.

tommyettinger commented 2 years ago

Out of order responses...

RE: point 2, I've started on coding a tiny little bit of this in a different project that needs point 2 for compatibility, and it is not easy because I really lack the fundamental background on some of this. However, that project, ShaiRandom, thankfully can pull much of this information (and implementations in C#) from Troschuetz.Random.

RE: point 1, in ShaiRandom I have implemented Poisson in a way that should be efficient for "small lambda," though "efficient" and "small" are both subjective. I see Binomial mentioned often, but still haven't been able to wrap my head around its Wikipedia article; I think it is probably much simpler than how it is explained. I don't know how hard Gamma is to implement; I seem to recall it can't be done with a method that only requests a finite fixed amount of random numbers, though that isn't a dealbreaker.

RE: point 4, Student's t also seems tough. I don't really understand the terms you used here (pretty much any of them).

Many of these distributions I can essentially copy the algorithms and information (such as mean, variance, etc.) from Troschuetz.Random, but it doesn't have excess kurtosis information that I can recall.

RE: point 3, I really am not the person to be implementing this, because I currently do not know what the moment of a distribution is.

Long story short, I am not at all well-versed in statistics or probability (I've never taken any class on it, and what I do know has been dragged clumsily out from Wikipedia). I can read the tools I use, and tell approximately what the p-values in PractRand and hwd mean, but more elaborate understanding of distributions is harder. I use MathVisualizer in the squidlib tests to check visually how some distributions are shaped (I used this for Kumaraswamy, which is really quite versatile), and this helps me understand a little, but it can't do much to show kurtosis or other subtle shape-related values.

I can try to implement these features in ShaiRandom first, so I understand them better. ShaiRandom helps fill a gap in the .NET random number generator libraries, and I want to get it usable soon so GoRogue can use it. Troschuetz.Random mostly has lackluster generators when it comes to quality, and the ones it has that pass tests are slower than they need to be. It also uses a rather-flawed method for producing its pseudo-random doubles that it feeds to all of its (much better) distribution code. I first saw the linked code on the webpage for the xoroshiro generators; since then it has been effectively discouraged from use by that same page.

Once I have these major distributions in ShaiRandom, I would need to expand IDistribution. This breaks backwards compatibility with any (probably just hypothetical) user code that implemented version 3.0.0's IDistribution. If we could use Java 8, which we still can't because of RoboVM limitations, we could use default methods to allow the older code to still at least build. That means SquidSquad could do this more easily than SquidLib, since it doesn't worry about backwards compatibility much, and it also uses Java 8 (in the expectation that RoboVM will eventually be able to support Java 8). I think I could fit distributions more deeply into new SquidSquad code than existing SquidLib code, anyway. There's a few key features I'd like to get into SquidSquad before a major semi-stable release, but I could probably start putting out 0.0.x releases soon. JitPack is our savior in the meanwhile.

aus101 commented 2 years ago

edited to fix Java Optional since I didn't realize > was the quotation character.

So you do C# as well. I'm jealous of Visual Studio compared the free buggy messes Java has in comparison.

I think it's fine to port stat code you may not understand the mechanics of. You have visualization aid and distributions are easy to check against in Excel or Google Sheets or whatever. The design of Troschuetz is really clever and that's genius defining Student's t-distribution pdf as a combination of Gaussian (Normal) and Chi Squared. I see they give the mean and variance of every distribution. You may be more familiar with standard deviation = sqrt(variance) but variance is far more fundamental to statistics.

I don't know C# well enough to port it but I could add user-specified mean and variance to your Gaussian distribution so it isn't only capable of 0 mean and 1 variance. One of your links shows the very small changes needed.

I think it's okay if you keep IDistribution as is. In my 10 years of corporate Java programming, interfaces are the things we add at the very end to make the design look better and pass code review. Can still have mean, variance, etc. accessors in each distribution. Fringe argument that not having interface methods is better because of small overhead cost of virtual function calls.

RoboVM, I've never done mobile Java programming. Guess I wouldn't use it if I start new project when it's capped to 7? Default methods are cool but I've never seen them in the wild. Lambda expressions and Java <Optional> pop up though.

Skew is important in stock market pricing models and determining your maximum error from using the Gaussian distribution approximation via Central Limit Theorem. To 90-95% people using stats, they probably just want mean and variance and sqrt(variance) = standard deviation. In stocks you also get into co-variance because stocks in the same industry can go up or down together. Not independent trials. Co-variance matrices give me headaches.

more stats background info below, can skip

My intro to hardcore stats was college course in calculus-based statistics for engineering. Blew my mind. Each distribution is uniquely characterized by a moment generating function (MGF). An alternative way of finding the mean is taking the derivative of that with respect to t then setting t = 0. The variance is found by taking the second derivative of that and setting t = 0. Thus the name "first moment" and "second moment" and so forth.

Usually you don't have to go calculus-deep but what if someone decides to replace lambda in Poisson with the entire Gaussian distribution? Happens in insurance claim modeling all the time. You may have to use calculus to find what that exact mean is or what the distribution is that is a combination of others. Add the MGFs together and see what happens.

aus101 commented 2 years ago

Thank you for the detailed explanation and elaborating on every point! I wanted to make separate comment on Binomial distribution. Maybe I can explain it. Is that what comments are for? I'm new here. Coding in my spare time is foreign concept.

You know what, people tell me I'm smart and your atan range reduction went right over my mind. I didn't understand the Binomial Wikipedia explanation when I found it a few years ago either. What I did was calculate the probability of something in Excel then work through the math the long way.

Let's say we're playing a game where chance I win is 60% and independent trials so winning 5x in a row doesn't mean I'm more or less likely to win the next time. Is "memoryless" property. Can say I'm flipping a rigged coin if you want. Now what are odds I win exactly 2 times in 5 games?

=BINOM.DIST(2, 5, 0.6, 0) = 0.2304 = 23.04% chance for exactly 2 successes out of 5 with chance of success of 0.6 (60%) By definition, chance of failure is (1 - chance of success) so 0.4 (40%).

Most people can work out odds of 5 successes in a row = 0.6^5 = BINOM.DIST(5, 5, 0.6, 0) = 0.0777 (7.8%) but they get 2 out of 5 wrong. You know has 3 failures but 0.6^2 * 0.4^3 = 0.02304 -> exactly 10 times lower than the correct answer. Standard notation is p for success chance (0.6) and q for failure chance (0.4). Note p + q = 1. Called "Binomial" distribution because each trial of the 5 has only 2 possible outcomes.

The trick is multiplying the 0.02304 by how many ways you can arrange this group of 5 trials. Could call the 2 we want P's and the 3 we don't Q's:

PPPQQ QPPPQ QQPPP PPQQP PPQPQ QPPQP QPQPP PQQPP PQPQP PQPPQ

10 total ways to arrange. Thus P(X=2) is 10 0.6^2 0.4^3 = 0.2304 = BINOM.DIST(2, 5, 0.6, 0) from above.

IRL no one manually counts the ways to arrange. We use the amazing "combinations" COMBIN function bundled in every spreadsheet: COMBIN(5, 2) = 10. How many ways to arrange 4 aces in a deck of cards? COMBIN(52, 4) = 270725.

Note that COMBIN(5, 2) is equal to COMBIN(5, 3) due to symmetry. Same ways to arrange (3P, 2Q) as (2P, 3Q). COMBIN(5, 5) = 1 as only 1 way to arrange PPPPP. So actually the 5 successes in a row example uses COMBIN too. This COMBIN is the "binomial coefficient" in the article of (n p) vertically arranged with n of 5 and p of 2 here. Also note that 5 failures in a row QQQQQ has only 1 way to arrange:

COMBIN(5, 0) qqqqq = (1) 0.4^5 = 0.01024 = BINOM.DIST(0, 5, 0.6, 0) = BINOM.DIST(5, 5, 0.4, 0).

If you added up chances for 0 to 5 successes then total odds = 1. Fun fact: the mean = expected value is n p = 3 and variance = n p * q = 1.2 and standard deviation of sqrt(1.2). Can see max variance is when p = q = 0.5, which makes sense if you think about it, as a symmetrical distribution is the widest.

tommyettinger commented 2 years ago

Wow, that helps quite a bit re: binomial distribution. I think the part I didn't pick up on initially was that the distribution has more variables (parameters?) than I thought it would. That would explain the lack of clear 2D graphs I could find. I think I understand this well enough to try porting Troschuetz.Random's binomial distribution now.

Also, I realized I can have an IDistributionDetailed that has extra information in addition to what IDistribution can do; I would just extend the interface with another interface. That wouldn't break backwards compatibility at all. Regarding virtual function overhead in interfaces, a getter for a constant value or unchanged member variable should typically be inlined easily by the JVM, avoiding some or all of that overhead -- the issue is when the getter needs a complex computation, which some of the parameterized distributions do need.

aus101 commented 2 years ago

Sounds good to me. The mean, variance, etc. are all fixed when the distribution gets created with the required parameters so can be calculated then and there.

I added the mean and standard deviation to GaussianDistribution but GitHub is blocking me from publishing to make a pull request due to no write access. I'm not a cryptocurrency miner I swear.

I'm not sure how to construct the proper IRNG here but testing looked solid when using Java 11 java.util.Random and Math.sin and Math.cos. I needed to multiply the variate by 2pi, else no value could be below the mean. I assume NumberTools.sin and NumberTools.cos account for that.

The webpage for the xoroshiro generators was new to me. I like how Sebastiano, one of the authors, says he implemented xoroshiro in Java 17. I looked up the OpenJDK implementation and found it. Looks easy to down port into Java 7.

tommyettinger commented 2 years ago

To make a pull request you need a fork, and can submit the PR from there. I believe it isn't especially complicated to fork on GitHub, clone your fork, and then copy your changes into the cloned code. You could then push that and handle the PR from the web interface (there's also a GitHub CLI to handle forks, PRs, repo creation, etc. without a browser, but I haven't used it).

sin() and cos() take their argument in turns, not radians, so yeah multiplying by 2PI would give you the same expected result, roughly, as Math.sin() and Math.cos(). Using turns makes a lot of sense in a lot of places; it also helps because you can just subtract the floor of an argument instead of using modulo and correcting for negative inputs. I also provide radian and degree versions in NumberTools. The underscored methods are only named like that because I didn't know the terminology for a rotation in turns at the time, and changing them would break a lot of compatibility.

Usually you don't need to construct an IRNG, just a RandomnessSource (which can be given in the constructor to RNG, and that handles the IRNG implementation). RandomnessSource only needs next(int), nextLong(), and copy() to be implemented; you can look at any of its many existing implementations for guidance.

Xoroshiro128+, which was recommended by Vigna for a long time (and now isn't recommended in many cases), is present in SquidLib as XoRoRNG. I more recently brought over XoshiroStarStar64RNG, which is xoshiro256** (the naming scheme is very confusing because there are two perfectly valid generators called xoshiro256**; one is 32-bit and one is 64-bit, and this one is the 64-bit version). I'm not fully on-board with the xoshiro256++ versions; they aren't as well-equidistributed as the ** generators and that's the main draw of a xoshiro generator (speed certainly isn't a reason to use xoshiro on the JVM, since many are faster with comparable quality). I also should say that I have never been able to verify the speed claims that Vigna's page makes for xoroshiro and xoshiro generators, even in C++: https://quick-bench.com/q/qADxNlE7LQ8TeYghIbusmylGcAY shows SplitMix64 (LightRNG in SquidLib) as faster than either xoshiro256** or xoroshiro128+ (where xoroshiro128+ is what Vigna claims is fastest).

aus101 commented 2 years ago

Thank you, wasn't as hard as I thought to make a pull request. I guess that's the point of GitHub really. It shouldn't be hard.

Cool so sin() and cos() work the way I expected. That is better to imbed the 2PI for engineering calculations such as transfer functions and you're saying it improves the range reduction as well.

Thanks for explaining IRNG. Would be nice to make the distribution easier to use with a no argument nextDouble() call. Like how base Java gives you a default one with new Random() for beginners but allowing more advanced users to call on a specific algorithm.

So you've already down ported Xoroshiro, I'm impressed. I don't see where the author no longer recommends Xoroshiro128+ but you clearly know more about this than I do. I thought the website's stated generation rate of < 1 ns for 64 bits was too good to be true. I'm jealous C++ world has an easy to use benchmarking website. You saved me real effort if I want to down port a good RNG for my own purposes. I didn't think Mersenne Twisters were the ideal in 2020 and beyond.

Being technical, Wikipedia alleged intelligence says the Box-Mueller method that you use is limited to values within 6.660 standard deviations of the mean. Full double precision would allow 9.419. I think that's completely fine because if you really care about extreme tail behavior, you know to stack on a different distribution to model it.

Messing with Excel NORM.INV function, I need a 1 in 100 billion chance to be outside 6.66 standard deviations. In proper statistics terminology, that is an alpha/2 Z Score of 6.71 for a ludicrous Confidence Level of 99.999999999% - and by then you have to be concerned with floating point error.

tommyettinger commented 2 years ago

Glad the PR wasn't too hard! I only had to adjust the package, the math was all solid. I think by now you have probably noticed that when including mean and standard deviation into an N(0, 1) distribution, the only change per-random-number that has to made is multiplying by standard deviation and adding mean. Your code does essentially that, but more efficiently (by multiplying only once per two generated values). Users who who want to take an N(0, 1) distribution and change the mean can do so by adding the mean to an N(0, 1) sample (in which case the standard deviation can be left as-is); it's similar for changing standard deviation by multiplying a sample. Changing both only needs to have the right order (multiply then add).

There actually is a nextGaussian() method in java.util.Random, and it also uses Box-Mueller (it also has only a mu and sigma of 0 and 1). The default implementation of IRNG, which is just RNG, has an asRandom() method that can get an object of an java.util.Random subclass that uses the same RandomnessSource; you can call nextGaussian() on this, multiply by std. dev, and add mean even if there was no devoted library code to handle mean and std. dev.

I use a different technique for approximating a Gaussian distribution in newer code, since I usually want to avoid caching part of a result for later (that half-result needs to be serialized like it is part of the state of the random number generator, and if it isn't, then saving and loading RNG states for later computation breaks). The new way I use is a probit() function called on a random double that is exclusive on both 0 and 1; probit() can't get as far away from the mean, but it's fairly quick to compute and needs only one random variate, with no trigonometry. The code for probit() is in MathExtras, in the same huge package as GaussianDistribution. MathExtras needs some cleaning up, but it has some useful code for various tasks, like a modular multiplicative inverse calculator for int and long (mod 2 to the 32, and 2 to the 64, respectively).

aus101 commented 2 years ago

Ah sorry I didn't change the package back. Yes, can use out of the box [0,1) nextDouble() with a multiply then an add but better to officially support versus make the user do the transform for every call.

Probit, that's cool, I know what it means but hadn't seen an implementation in code. So it uses a Remez minmax polynomial most likely initialized with Chebyshev (p/q) polynomial. That's basically what I'm waiting out a MATLAB license to do. Nice that you can avoid a lookup table and be more accurate than textbook Z Score of 2 decimal places.

The rational (p/q) form of order 5,4 lets it be more accurate than a 9 term ax^8+... expression would be (order 9,0) and match asymptotic behavior. I've not seen more than 3 decimal places used in a Z Score in financial research, so may want to consider removing the division and reducing the terms for a probit() specifically for probability estimates.

Serializing a random number generator is not something I ever thought of doing but then you do need to save the half-result if you want true deterministic behavior from a fixed seed.

I saw there is a polar form of Box-Muller but I presume it's slower than the existing trigonometry approach due to the division and having to call it more than 1.0 times per random number.