xflouris / libpll

Phylogenetic Likelihood Library
GNU Affero General Public License v3.0
26 stars 6 forks source link

Possible model extensions #101

Open stamatak opened 7 years ago

stamatak commented 7 years ago

Hello, I was wondering whether it is among your plans (and whether it is possible to implement) to add protein mixture models to RaxML, particularly C10-C60 (http://bioinformatics.oxfordjournals.org/content/24/20/2317.short), and UL,EX,EHO (http://rstb.royalsocietypublishing.org/content/363/1512/3965.short). The aim is, if feasible, to be able to apply the mixture model in conjunction with protein GTR (separately estimated for each profile, or universally) or different substitution matrices. Also, any plans on adding freerate heterogeneity as an option? Thanks a lot.

I think impkementing UL, EX, EHO might be a good idea (Olivier Gascuel liked those models), regarding free rates, I am not so sure, if it's not already implemented

ddarriba commented 7 years ago

Rate categories are already free in the library. In fact there is no shape parameter in the partition but just the number of discrete rate categories and their values.

stamatak commented 7 years ago

:-)

alexis

On 27.07.2016 10:20, ddarriba wrote:

Rate categories are already free in the library. In fact there is no shape parameter in the partition but just the number of discrete rate categories and their values.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xflouris/libpll/issues/101#issuecomment-235518365, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1w-paq51sg4eWG7g8Ifn_4GB6laeI1ks5qZxTAgaJpZM4JV5r-.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

bqminh commented 7 years ago

I already implemented all these models in IQ-TREE. One has to reimplement the likelihood kernel for these models. For model parameter optimization one can employ an EM algorithm; for estimating mixture weights the EM algorithm guarantees to find optimal solutions. Thus if you need help, let me know.

Minh

amkozlov commented 7 years ago

@bqminh: thanks for offering your help!

Do I understand it correctly that for UL/EX/EHO models the rates&weights are fixed:

http://www.atgc-montpellier.fr/download/datasets/models/mix_RatesProps.txt

so there are actually no parameters to optimize?

bqminh commented 7 years ago

that's right, these models have default values for rates and weights. However, one should give a possibility to optimize the weights (while still fixing rates). I observed significant gain in likelihoods. Moreover, there is special PhyML version (very slow), which also allows to optimize weights. As I noticed, the EM algorithm can be used for this purpose.

Minh

ziheng-yang commented 7 years ago

here are some random comments. the discrete-rate model is in paml/baseml since 1994. this is described in YANG, Z., 1995 A space-time process model for the evolution of DNA sequences. Genetics 139: 993-1005. table 2 has some real data results. i use BFGS so that the optimisation is similar to the discrete gamma model. if you estimate both the frequencies and the rates as free parameters, you can't fit many categories (like 5 or 6) in real data analysis, but that may be because i tested using small datasets without many sequences in the alignment.

i think that if the interest is in the phylogeny and branch lengths, there is not that much difference among the different rate models.

also my impression is that EM is inefficient as an optimisation algorithm.

best, ziheng

At 13:39 31/07/2016 -0700, Bui Quang Minh wrote:

that's right, these models have default values for rates and weights. However, one should give a possibility to optimize the weights (while still fixing rates). I observed significant gain in likelihoods. Moreover, there is special PhyML version (very slow), which also allows to optimize weights. As I noticed, the EM algorithm can be used for this purpose.

Minh

� You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

stamatak commented 7 years ago

dear ziheng,

many thanks for your insights, these pretty much reflect my intuition about the problem.

all the best,

alexis

On 01.08.2016 12:45, ziheng-yang wrote:

here are some random comments. the discrete-rate model is in paml/baseml since 1994. this is described in YANG, Z., 1995 A space-time process model for the evolution of DNA sequences. Genetics 139: 993-1005. table 2 has some real data results. i use BFGS so that the optimisation is similar to the discrete gamma model. if you estimate both the frequencies and the rates as free parameters, you can't fit many categories (like 5 or 6) in real data analysis, but that may be because i tested using small datasets without many sequences in the alignment.

i think that if the interest is in the phylogeny and branch lengths, there is not that much difference among the different rate models.

also my impression is that EM is inefficient as an optimisation algorithm.

best, ziheng

At 13:39 31/07/2016 -0700, Bui Quang Minh wrote:

that's right, these models have default values for rates and weights. However, one should give a possibility to optimize the weights (while still fixing rates). I observed significant gain in likelihoods. Moreover, there is special PhyML version (very slow), which also allows to optimize weights. As I noticed, the EM algorithm can be used for this purpose.

Minh

� You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xflouris/libpll/issues/101#issuecomment-236548479, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1w-n8R33UhA9M5Ou3a-lmEyBH_KFIzks5qbc5UgaJpZM4JV5r-.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

bqminh commented 7 years ago

Hi Ziheng,

thanks for your comments! please see my replies below,

On Aug 1, 2016, at 12:45 PM, ziheng-yang notifications@github.com wrote:

here are some random comments. the discrete-rate model is in paml/baseml since 1994. this is described in YANG, Z., 1995 A space-time process model for the evolution of DNA sequences. Genetics 139: 993-1005.

yes, I know your paper. And it is quite interesting that some authors reiterated this model (like http://mbe.oxfordjournals.org/content/29/11/3345.full http://mbe.oxfordjournals.org/content/29/11/3345.full) but unaware of your paper. I like the model because it does not assume any distribution.

table 2 has some real data results. i use BFGS so that the optimisation is similar to the discrete gamma model. if you estimate both the frequencies and the rates as free parameters, you can't fit many categories (like 5 or 6) in real data analysis, but that may be because i tested using small datasets without many sequences in the alignment.

this is exactly with big data sets where the two models give rise to different results… I can show you the data once our paper gets published.

i think that if the interest is in the phylogeny and branch lengths, there is not that much difference among the different rate models.

also my impression is that EM is inefficient as an optimisation algorithm.

this was also my thought at the beginning. I originally implemented the BFGS algorithm, but then we observed with simulated data (very long alignments) that it sometimes does not find the true rates and weights, which is weird. Afterward I implemented the EM algorithm, and it always found the true estimates. That’s why I switched to the EM algorithm.

Note that BFGS and EM are both local optimization. So one can never be sure if the optimal estimates are reached.

Minh

best, ziheng

At 13:39 31/07/2016 -0700, Bui Quang Minh wrote:

that's right, these models have default values for rates and weights. However, one should give a possibility to optimize the weights (while still fixing rates). I observed significant gain in likelihoods. Moreover, there is special PhyML version (very slow), which also allows to optimize weights. As I noticed, the EM algorithm can be used for this purpose.

Minh

� You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xflouris/libpll/issues/101#issuecomment-236548479, or mute the thread https://github.com/notifications/unsubscribe-auth/AOM302Jfl07R440o7qSmhHh1CSdCv4C2ks5qbc5VgaJpZM4JV5r-.

Bui Quang Minh Center for Integrative Bioinformatics Vienna (CIBIV) Campus Vienna Biocenter 5, VBC5, Ebene 1 A-1030 Vienna, Austria Phone: ++43 1 4277 74326 Email: minh.bui (AT) univie.ac.at