Per-category scalers - Githubissues

If the shape of the Gamma distribution is low enough, it might happen that the likelihood scores of different categories require different scaling. That means that at some point it is very likely that some per-category likelihoods go under precision. We can keep one individual scaler for each site and category (thus this requires 'c' times more memory for the scalers, where 'c' is the number of different categories), and the likelihood computation is modified as follows:

codecogseqn s_{i,j} is the scaler for site 'i' and category 'j'. K is the likelihood scaling threshold

Apart from the additional memory required, which is not excessive compared to the CLVs, we need to check whether each term with scaling involved in the sum (i.e., s{i,j} > s{i,min}) may or may not go out of precision. If we assume that K is fixed to 2^256, then a scaling >= 2 is enough for discarding the term. However, we might want to implement this in a more generic way and decide which would be the highest supported difference in the scalers. For example, we could change the scaling threshold at some point for testing purposes.

for the sake of completeness:

the problem was first described here:

http://www.biomedcentral.com/1471-2105/12/470

an alternative solution for discarding (and saving more computations) would be to use a bit vector for each discrete rate and each CLV that can be set if the value shall be discarded because it is too small ...

alexis

On 18.09.2015 14:11, ddarriba wrote:

If the shape of the Gamma distribution is low enough, it might happen that the likelihood scores of different categories require different scaling. That means that at some point it is very likely that some per-category likelihoods go under precision. We can keep one individual scaler for each site and category (thus this requires 'c' times more memory for the scalers, where 'c' is the number of different categories), and the likelihood computation is modified as follows:

codecogseqn https://cloud.githubusercontent.com/assets/3135601/9958692/d7da416a-5e0c-11e5-86b9-0f907184d136.png s_{i,j} is the scaler for site 'i' and category 'j'. K is the likelihood scaling threshold

Apart from the additional memory required, which is not excessive compared to the CLVs, we need to check whether each term with scaling involved in the sum (i.e., s{i,j} > s{i,min}) may or may not go out of precision. If we assume that K is fixed to 2^256, then a scaling >= 2 is enough for discarding the term. However, we might want to implement this in a more generic way and decide which would be the highest supported difference in the scalers. For example, we could change the scaling threshold at some point for testing purposes.

— Reply to this email directly or view it on GitHub https://github.com/xflouris/libpll/issues/44.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

Please find my experimental implementation here:

https://gist.github.com/amkozlov/84eec65bcc4b22e48e88

(unfortunately it's not possible to attach files here - how stupid is this??)

The main changes are in the following functions:

newviewGTRGAMMA_AVX() - compute&store individual per-rate scalers

evaluateGTRGAMMA() - use individual scalers to compute tree logLH

sumGAMMA() - use individual scalers to pre-compute sum[] vector for derivatives

This is a very naive implementation and can be further optimized, but even now it's just 5% slower compared to standard RAxML on my dataset (~1000 taxa, ~2500 sites, full tree search under GAMMA).

Of notice, the resulting logLH scores I obtained with this method and with the "old" one are significantly different (-159510.676606 vs. -159402.082011) - which means that either there were a lot of "underflow losses" with the old approach, or I've just screwed something up :)

So I'd appreciate if you could peer-review the code....

"If the shape of the Gamma distribution is low enough, it might happen that the likelihood scores of different categories require different scaling. That means that at some point it is very likely that some per-category likelihoods go under precision."

at the risk of not knowing what you guys are talking about, why do you not just ignore the probability (it is not called likelihood) for the site and category if it goes under precision. the probability for a site is a sum (average) over the categories, and if the contribution from one category is nearly 0, much much smaller than contributions from other categories, you simply treats it as 0, and there is no need to calculate it precisely. or am I missing something? ziheng

On 18 September 2015 at 13:50, Alexis Stamatakis notifications@github.com wrote:

for the sake of completeness:

the problem was first described here:

http://www.biomedcentral.com/1471-2105/12/470

an alternative solution for discarding (and saving more computations) would be to use a bit vector for each discrete rate and each CLV that can be set if the value shall be discarded because it is too small ...

alexis

On 18.09.2015 14:11, ddarriba wrote:

If the shape of the Gamma distribution is low enough, it might happen that the likelihood scores of different categories require different scaling. That means that at some point it is very likely that some per-category likelihoods go under precision. We can keep one individual scaler for each site and category (thus this requires 'c' times more memory for the scalers, where 'c' is the number of different categories), and the likelihood computation is modified as follows:

codecogseqn < https://cloud.githubusercontent.com/assets/3135601/9958692/d7da416a-5e0c-11e5-86b9-0f907184d136.png

s_{i,j} is the scaler for site 'i' and category 'j'. K is the likelihood scaling threshold

Apart from the additional memory required, which is not excessive compared to the CLVs, we need to check whether each term with scaling involved in the sum (i.e., s{i,j} > s{i,min}) may or may not go out of precision. If we assume that K is fixed to 2^256, then a scaling >= 2 is enough for discarding the term. However, we might want to implement this in a more generic way and decide which would be the highest supported difference in the scalers. For example, we could change the scaling threshold at some point for testing purposes.

— Reply to this email directly or view it on GitHub https://github.com/xflouris/libpll/issues/44.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

— Reply to this email directly or view it on GitHub https://github.com/xflouris/libpll/issues/44#issuecomment-141443352.

Dear Ziheng,

of course, this was my (and not only my) first thought as well. However, as we go up the tree and compute the conditionals, it might happen that in some internal node X there will be not a single category which would have probability values above precision in both left (L) and right (R) child vectors.

E.g. something like this:

rate	r1	r2	r3	r4
L	0	1	0	0
R	1	0	1	0

where "0" denotes an "underflowed" category (= all zeros), and "1" - a category for which the probabilities are non-zero. Obviously, after multiplication we will get 16 zeros in the probability vector at the node X.

And actually I would not bother going into all this if I wouldn't have a real dataset at hand, which exhibits this problem - so I was unable to compute the GAMMA logLH value using the implementation we currently have in RAxML/ExaML/PLL.

So in the end, in my implementation I still use the same observation you've made: contributions of the categories with probabilities near 0 (= with scaler value greater than s_min) are relatively small. However, I do two additional things to avoid the aforementioned pitfall:

1) comparison of scalers and thus the decision on which categories to ignore is deferred until we are at the virtual root (evaluateGTRGAMMA() or sumGAMMA() function, s. above)

2) the categories with s_i > s_min are not completely ignored, if the difference is reasonably small: e.g., if s1=s_min=1 and s2=1, then the actual probabilities for two categories s1 and s2 might be well comparable - depending on values we have in LK1 and LK2 (s. formula in the original post)

Hope this explains our rationale a bit...

i actually am using different scalars for different categories in my code. i think it is possible to use one set of scalars for all categories, and dynamically change the scalars during the computation if necessary, but the effort will be too great for it to be worthwhile. anyway i havent had the need to think about this in the past 15 years at least, and i never dealt with very large datasets. all in all my comments are fairly random.

is the concern the memory requirement for the scalars?

On 18 Sep 2015, at 17:12, Alexey Kozlov notifications@github.com wrote: Dear Ziheng, of course, this was my (and not only my) first thought as well. However, as we go up the tree and compute the conditionals, it might happen that in some internal node X there will be not a single category which would have probability values above precision in both left (L) and right (R) child vectors.

E.g. something like this:
 r1  r2  r3  r4
L: 0 1 0 0 R: 1 0 1 0

where "0" denotes an "underflowed" category (= all zeros), and "1" - a category for which the probabilities are non-zero. Obviously, after multiplication we will get 16 zeros in the probability vector at the node X.

i think you multiply L & R for each category, not across categories, so you should get four 0s in your example. i suppose if you use separate scalars for the 4 categories, none of the 0s should occur.

And actually I would not bother going into all this if I wouldn't have a real dataset at hand, which exhibits this problem - so I was unable to compute the GAMMA logLH value using the implementation we currently have in RAxML/ExaML/PLL.

So in the end, in my implementation I still use the same observation you've made: contributions of the categories with probabilities near 0 (= with scaler value greater than s_min) are relatively small. However, I do two additional things to avoid the aforementioned pitfall:

1) comparison of scalers and thus the decision on which categories to ignore is deferred until we are at the virtual root (evaluateGTRGAMMA() or sumGAMMA() function, s. above)

2) the categories with s_i > s_min are not completely ignored, if the difference is reasonably small: e.g., if s1=s_min=1 and s2=1, then the actual probabilities for two categories s1 and s2 might be well comparable - depending on values we have in LK1 and LK2 (s. formula in the original post)

why do you have to make a decision as to whether to ignore? if you calculate a + b, where a = 0.1 and b = 1e-100, you will get 0.1. what do you save by checking whether b is small enough to be ignored. we are adding and not taking logs, so there is no need to know whether b is 0 or very small.

ziheng

Hope this explains our rationale a bit...

— Reply to this email directly or view it on GitHub.

i actually am using different scalars for different categories in my code. i think it is possible to use one set of scalars for all categories, and dynamically change the scalars during the computation if necessary, but the effort will be too great for it to be worthwhile. anyway i havent had the need to think about this in the past 15 years at least, and i never dealt with very large datasets.

Yes, this issue only comes up on large datasets in terms of number of taxa (mine has ~10000 sequences).

is the concern the memory requirement for the scalars?

not really, as Diego mentioned above, memory requirements are dominated by CLVs (prob vectors), by a large margin

E.g. something like this:

r1 r2 r3 r4 L: 0 1 0 0 R: 1 0 1 0

where "0" denotes an "underflowed" category (= all zeros), and "1" - a category for which the probabilities are non-zero. Obviously, after multiplication we will get 16 zeros in the probability vector at the node X.

i think you multiply L & R for each category, not across categories, so you should get four 0s in your example.

Yes, that's right. I was actually referring to 16 vector entries / probabilities (4 categories x 4 DNA states), but this was misleading, sorry.

i suppose if you use separate scalars for the 4 categories, none of the 0s should occur.

Correct.

why do you have to make a decision as to whether to ignore? if you calculate a + b, where a = 0.1 and b = 1e-100, you will get 0.1. what do you save by checking whether b is small enough to be ignored. we are adding and not taking logs, so there is no need to know whether b is 0 or very small.

yes, but we can save some computations: if we know that b will be below precision (), we don't have to undo the scaling for b (i.e. don't need to multiply it with K^(s_i - s_min)). But my main point here was that we use individual per-category scalers and preserve all probabilities during the tree-traversal stage (in newview() function), no matter how small they get - and then compare the scalers at the virtual root while computing the final tree likelihood.

xflouris / libpll

Per-category scalers #44