yellowstonegames / SquidLib

Useful tools for roguelike, role-playing, strategy, and other grid-based games in Java. Feedback is welcome!
Other
448 stars 46 forks source link

Distribution increments #210

Closed aus101 closed 2 years ago

aus101 commented 2 years ago

The existing Gaussian distribution only computes probabilities for mean of 0 and standard deviation of 1 for N(0,1) distribution. That's the most common usage by far but I added a constructor to supply a different mean and standard deviation. The no argument constructor remains the N(0,1) distribution.

I studied the Box–Muller transform enough and tested with default java.util.Random to be confident in the addition of mu (mean) and sigma (standard variation) to the calculations. If you consider the extra multiplication and addition to be significant, it makes sense to have a separate nextDouble without them for N(0,1).

Allowing any (double) mean and standard deviation, it makes a lot of sense to be able to access what they are with get calls. However, I believe the author will add those from an extended interface.

I'd like to see nextDouble that doesn't need an IRNG implementation supplied that provides one under the hood for easier use for the statistics beginner.

tommyettinger commented 2 years ago

This should be handy, for the Gaussian/normal distribution in particular. Math looks correct, but the package seems to have gotten scrambled -- I'd be (only slightly) surprised if this compiles as-is, since the file location doesn't match the package. It's a quick fix, though, and I might be able to do it through the web interface. I don't know what Z Score tables are, but they appear to be frequently-used terminology in statistics, so I'll trust what you wrote about them.

I might be able to go through the history later and figure out what parts of the code were originally from Go's standard library so I can remove them -- they used the Ziggurat method, but it wasn't as efficient here and was very hard to understand.

aus101 commented 2 years ago

Again my bad on changing the package name. I exported what classes I wanted to see in my own project rather than import SquidLib as a whole.

I can explain. Don't have to read, is long, but Z Score is taught for weeks in a classroom. Every entry level statistics textbook has 2-4 pages of what we'd call lookup tables. One table for Z Scores. and one for T Scores. They are symmetric distributions so probability of Z(1.5) = 1 - Z(-1.5). They are also cumulative as being the numerical approximations of the probability density functions. The integral type of e^(x^2) has no closed form so we have to approximate the correct probability, which we can do to machine precision but most people only care about 4 digits.

Z Score

Say you're calculating the average insurance claim for a business. You know the mean is supposed to be $5000 (or Euro, etc.) and standard deviation of $400. Your estimate from that month's data is $4820. You want to know if that is "statistically significant" for being within the expected range of $5000 or not. If it's below the expected range, your business can expect to profit more! Maybe you can lower member rates. Choose standard p value of 0.05 for 95% confidence.

In Z Score table, we look for value (really the area of probability) for 0.025 and find -1.96. That is to say, 2.5% of the area is at that score and below. Looking for equivalent 0.975 would yield +1.96. Area on both sides sum to 0.05 = 5%. Equivalently, we expect 95% of our claims to be within 1.96 standard deviations of $1500.

Let's keep member rates where they are if the Z Score we calculate is within (-1.96, +1.96). What this means is, if our soon to be calculated Z score is -2, +2.8, etc. then rates we charge are too high (or too low). At least, if we repeated this test 20 times, we'd be right 19/20 = 95% of the time because bad luck is a thing.

Formula for Z Score here is (4820 - 5000)/400 = -0.45

General formula is z = (x – μ) / σ, where μ mu the expected mean and σ sigma the expected standard deviation and x is the value we measured or calculated. Might see an x with a bar on top. Works same as μ.

The -0.45 Is very safely within our range so let's keep the null hypothesis that $4820 is not "statistically different" from $5000, given the standard deviation and expected variance we see month to month. Let's not change rates or our expected profit for next month.

What is not explained in stats class is what the (x – μ) / σ transform is doing. It's normalizing the N(μ, σ^2) distribution into a point on an N(0, 1) distribution. That's why you only see one table for Z Score - it's the table for N(0, 1). Can apply this transform to every value you measure to make the whole distribution N(0, 1). In advanced statistics, that's often a requirement for proofs and theorems.

Could alternatively look up Z(-0.4) and find 0.3446. This is one-sided area and we're interested in values + or - from the mean, so double the area to 0.6892 for the p value. This is way higher than p value of 0.05, so is within our expectations and we should not adjust rates.

Better to account for how many samples in your population you measure. As sample size increases, the estimate becomes more accurate. Is the same idea in polling voters for political candidates. Greater sample size, the smaller the +/- % you see since our estimate approaches the true exact value of sampling the entire population. Just add in the square root of the sample size. That's how fast the accuracy increases:

(x – μ) / ( σ / sqrt(n) )

Say we have 2000 paying customers => (-180) / (400 / 44.721) = -20.13. So actually getting that far below $5000 with 2000 samples is an incredible 20 standard devastations and we would absolutely reject the null hypothesis and raise our expected quarterly profit or something. Can do algebra of (1.96 / |-0.45|)^2 from above to find 18.97. We round up to 19 (technically, use a ceil so 18.12 rounds to 19 as well) and say same $4820 average from a sample size of 19 would have been significant enough to reject the null hypothesis.

In this case you can measure all costumers with software automatically tracking payouts for claims but what about polling the population of voters in a election? Too expensive to poll a million people, let alone the whole population. Statistically, you really only need a sample size of a few thousand to be, say, 3% of true value due to the 1/sqrt(n) factor.

T Score

That brings up the T Score. We usually do not know the true mean or true standard deviation of what we are measuring. If we poll a candidate at 45%, true value could be 48% right? Average insurance claim could be seasonal and we may never have the exact true value. What we do here is use the T Distribution. It's like the Gaussian but with wider tails to account for us being less sure of the population's mean versus what we are measuring.

Same calculation with 1/sqrt(n) except the chart you use has a degrees of freedom part. Take sample size - 1. That's your degrees of freedom. Use that line or the nearest one in the chart. You'll see the variance calculation from a sample also uses an (n-1) factor. It's to avoid measurement bias from small n. You'll also see in a T Score chart that the higher the degrees of freedom (greater the sample size), the lower the score. This means the bell curve gets narrower and at some high n value of, say, 80 or 1000 or whatever a textbook tells you, the T Distribution converges to the Gaussian Distribution and can use that instead and not care about degrees of freedom anymore. As in, we are so sure of our estimated mean that is statistically the same as the true value found from a Gaussian distribution.

What you're actually looking up in a T Score table is the effective score, not the probability. Can see for our 95% confidence, 0.025/0.05 area column we used above for the Z Score's p value, that we'd need a T Score of +/- 1.734 standard deviations from the mean with 18 df to reject the null hypothesis with a sample size of 19, versus the higher and harder to reach +/- 1.96 in Z Score that doesn't get adjusted based on sample size due to knowing the exact mean and standard deviation. Note that the high df of 100, this T Score becomes +/- 1.984. Lay this T probability distribution on top of a Gaussian and they'd look the same.