xflouris / libpll

Phylogenetic Likelihood Library
GNU Affero General Public License v3.0
27 stars 6 forks source link

[fix_dev] Fix persite functions #75

Closed xflouris closed 8 years ago

xflouris commented 8 years ago

Fix the persite log-likelihood functions to account for tip patterns and also the new way of handling mixture models.

xflouris commented 8 years ago

Currently the per-site log-likelihood function is in fact a per-site-per-rate log-likelihood function. It computes wrong per-site-per-rate values as it does not (and cannot) account for scaling, since scalers are per-site and not per-site-per-category.

The function will now be changed to per-site only, and will be incorporated within the pll_edge_loglikelihood(...) function (original dedicated function will be removed to reduce code duplicity).

Function pll_edge_loglikelihood(...) will now receive an additional parameter (persite_lh) which, if set to NULL will be ignored, otherwise, it will be filled with per-site log-likelihoods (hence its size must be pre-allocated and equal to the number of sites).

amkozlov commented 8 years ago

BTW how about implementing per-category scalers?

It seems to make sense, at least for GAMMA: https://github.com/xflouris/libpll/issues/44

xflouris commented 8 years ago

@amkozlov : Do you need per-category scalers to have per-site-per-category log-likelihoods or for some new way of scaling? If it's the former then I don't think it makes sense to have them in the higher-level (non-core, partition-level) LLPLL api because it will a) increase memory by 'rate' times and b) slow the log-likelihood function quite a lot, and all that to satisfy a scenario that is not very common. However, it will be possible to compute per-site-per-category log-likelihoods using the equivalent core function for pll_compute_{edge,root}_loglikelihood (which is not yet implemented) by calling it 'rate' times, once for each rate.

If it's the latter, then we can discuss it in the meetings.

amkozlov commented 8 years ago

@xflouris: I guess it's neither of the two. There is a well known and long-standing bug in RAxML: analyses under GAMMA often crash on very large trees (let's say > 10000 taxa), if GAMMA rates become too distant and thus one scaler doesn't fit all. As far as I understand, LLPLL has the same problem as it uses a single per-site scaler.

We had an extensive discussion on that with Diego and Zihen last year, here is the link again, just in case you missed it: https://github.com/xflouris/libpll/issues/44

Back then I implemented the multi-scaler solution (described in the issue above) in the RAxML branch I use for SATIVA, and it seems to work well. In terms of performance/memory, the effect was minimal:

The only problem is that you can't use raxml-style "fast scaling" (w/o scaling array) anymore, but you don't have it in the library anyway. right?

We can revisit this in the next LLPLL meeting if you like.

stamatak commented 8 years ago

This is an important issue, especially for single gene 16S datasets with thousands of taxa, so we should have a way of dealing with it, the original numerical problem was first described in:

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-470

alexis

On 19.04.2016 21:51, Alexey Kozlov wrote:

@xflouris https://github.com/xflouris: I guess it's neither of the two. There is a well known and long-standing bug in RAxML: analyses under GAMMA often crash on very large trees (let's say > 10000 taxa), if GAMMA rates become too distant and thus one scaler doesn't fit all. As far as I understand, LLPLL has the same problem as it uses a single per-site scaler.

We had an extensive discussion on that with Diego and Zihen last year, here is the link again, just in case you missed it: #44 https://github.com/xflouris/libpll/issues/44

Back then I implemented the multi-scaler solution (described in the issue above) in the RAxML branch I use for SATIVA, and it seems to work well. In terms of performance/memory, the effect was minimal:

  • memory consumption is dominated by CLVs (DNA: CLV 16 doubles. scaler
    • 1 or 4 doubles - doesn't really matter; let alone AA)
  • there is not much additional computation needed: mb more CMPs but potentially less MULs. If I remember correctly, I measured a slowdown of ~5% on my dataset, but this might be not representative.

The only problem is that you can't use raxml-style "fast scaling" (w/o scaling array) anymore, but you don't have it in the library anyway. right?

We can revisit this in the next LLPLL meeting if you like.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/xflouris/libpll/issues/75#issuecomment-212096422

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

@amkozlov : Thanks, I'm aware of https://github.com/xflouris/libpll/issues/44. So indeed what you propose is a new way of scaling. "Fast scaling" is not present in the library, so that would not be a problem.

Perhaps I misunderstood something in your first bullet, but scalers in PLL are integers and not doubles. The memory consumption (in the case of 4 states with GTR+GAMMA) would rise from 3.125% to 12.5% of the total CLV vectors size.

I'm not sure if 5% slowdown is realistic, but I never tried it to have an opinion on this. However, MULs will not decrease (unless you have non-vectorized versions in mind) and the number of CMPs will remain the same. Resolving which entry needs to be multiplied with a scaler in order to correctly set the multiplier vector (I have the vectorized version in mind) will be harder (and slower), so I'd be a bit skeptic about the slowdown.

Then there will be some slowdown in the evaluation of the likelihood at the root, but for that I believe the 5% slowdown from the original function is realistic.

Perhaps we can offer two methods of scaling - the current one with one scaler per site and the new one with one per category - allowing the user to select the type of scaling in the partition attributes. But this will require some very careful planning on how to implement.

We can defer this to https://github.com/xflouris/libpll/issues/44 and the PLL meetings.

amkozlov commented 8 years ago

Perhaps I misunderstood something in your first bullet, but scalers in PLL are integers and not doubles. The memory consumption (in the case of 4 states with GTR+GAMMA) would rise from 3.125% to 12.5% of the total CLV vectors size.

Sure, they are ints, sorry.

Perhaps we can offer two methods of scaling - the current one with one scaler per site and the new one with one per category - allowing the user to select the type of scaling in the partition attributes. But this will require some very careful planning on how to implement.

maybe we can have an PLL_ATTRIB_PER_RATE_SCALER partition attribute to switch between implementations?

Here is my implementation, it's vectorized, but not thoroughly optimized and a bit messy: https://github.com/amkozlov/raxml-sativa/commit/d94f12a29fcb6bcf872218838c3b830c13642862#diff-c08d081bec6083c788f123eebeced25e

xflouris commented 8 years ago

per-site per-category defered to https://github.com/xflouris/libpll/issues/44