xflouris / libpll

Phylogenetic Likelihood Library
GNU Affero General Public License v3.0
27 stars 6 forks source link

Change of data types #22

Open xflouris opened 9 years ago

xflouris commented 9 years ago

Perhaps now it would be a good idea to decide the data types of the following pll_partition_t variables:

type Variable Role Suggested type
int tips Number of tip sequences unsigned long
int clv_buffers Number of CLV buffers (for inner nodes) unsigned long
int states Number of states int
int sites Number of sites in partition unsigned long
int rate_matrices Number of different rate matrices that will be used unsigned int
int prob_matrices Number of probability matrices to allocate unsigned long
int rate_cats Number of rate categories unsigned int
int scale_buffers Number of scalers unsigned long
unsigned int ** scale_buffer Scalers `unsigned long `**
int * invariant Indicates whether a site is invariant or not *`unsigned long `**
int * eigen_decomp_valid Indicates whether the eigen decomposition for a pmatrix is valid int *

My justification for the proposed changes is that int (or unsigned int) could be an insufficient range for large phylogenomic datasets. Changing tips, clv_buffers, sites, prob_matrices and scale_buffers would enable the analysis of datasets comprising of more than 4,294,967,295 sites. The drawback is that the memory requirements will double.

Also, if the proposed changed are done, it would also make sense to change scale_buffers, invariant to unsigned long ** and unsigned long * respectively, because the new data type (unsigned long) would then be of the same size (8 bytes) as a CLV entry (double), and this would improve vectorization efficiency.

What do you think?

ddarriba commented 9 years ago

I would suggest to use type definitions such that if for some reason we change our mind later we don't need to update the whole library.

mtholder commented 9 years ago

Most of changes and typedefs make sense to me.

Is there a reason that rate_cats, rate_matrices and states should not be unsigned ints? I realize that they won't hit limit where the extra bit will matter. But, they can't really be negative, so unsigned seems to make sense.

If using stdbool.h is not a problem, then it seems like eigen_decomp_valid could be bool *

ddarriba commented 9 years ago

I agree changing the types related to the number of sites, but I guess an unsigned integer is more than enough for the variables related to the number of taxa or nodes (tips / scale buffers / clvs/ ...)

http://www.nature.com/news/2011/110823/full/news.2011.498.html

xflouris commented 9 years ago

@ddarriba sure, but we have to decide on the data types first.

@mtholder No reason, rate_cats and rate_matrices should be unsigned int (updated in the table)

I'll check for the portability of uint64_t (stdint.h) and bool (stdbool.h) for Win/Mac and make a decision whether to typedef new data types or use the existing ones from c99.

xflouris commented 9 years ago

New data types table as discussed with Alexis and Diego:

type Variable Role Suggested type
int tips Number of tip sequences unsigned int
int clv_buffers Number of CLV buffers (for inner nodes) unsigned int
int states Number of states unsigned int
int sites Number of sites in partition unsigned int
int rate_matrices Number of different rate matrices that will be used unsigned int
int prob_matrices Number of probability matrices to allocate unsigned int
int rate_cats Number of rate categories unsigned int
int scale_buffers Number of scalers unsigned int
unsigned int ** scale_buffer Scalers `unsigned int `**
int * invariant Indicates whether a site is invariant or not int *
int * eigen_decomp_valid Indicates whether the eigen decomposition for a pmatrix is valid *`bool `**
xflouris commented 9 years ago

finished. eigen_decomp_valid kept to int. We may change data types according to C99 standard in the future or use typedefs.

amkozlov commented 8 years ago

Sorry for reopening this old issue, but probably we should consider using unsigned long at least for the number of sites? There were already a couple of user requests on raxml google groups about multi-GB analyses...

For the tips/taxa count, it's less probable to hit the 4G limit, but still possible: taxa are not necessarily species, but could be also individuals, viral sequences, raw reads etc. etc.

xflouris commented 8 years ago

@amkozlov : I guess sites would indeed make sense. I vaguely remember I wanted to change sites to unsigned long last year. As for the taxa, have you got a data set with over 4G of sequences? Even raw reads are in the order of hundreds of millions...

Anyway, we can discuss about data changes at the meetings, and do any changes after the first public release, as data type changes may (not sure yet) require quite a lot of work and validation.

amkozlov commented 8 years ago

@amkozlov https://github.com/amkozlov : I guess |sites| would indeed make sense. I vaguely remember I wanted to change sites to |unsigned long| last year.

Great.

As for the taxa, have you got a data set with over 4G of sequences?

Not yet.

Even raw reads are in the order of hundreds of millions...

Well, one could combine reads from multiple runs for a meta-study, we have developments going into this direction in one of the projects (UniEuk). But this is more of a theoretical possibility so far and could be most probably solved in a different way (streaming, clustering). So I'm not insisting on this.

Anyway, we can discuss about data changes at the meetings, and do any changes after the first public release, as data type changes may (not sure yet) require quite a lot of work and validation.

Just thought that maybe you'd like to avoid changing the interface afterwards... but I agree it would require too many changes, so probably unrealistic before release.

stamatak commented 8 years ago

we definitely need unsigned long for the sites, I had to change this at some point in ExaML because the integers were not sufficient any more, regarding the taxa it may make sense as well, especially for the EPA

alexis

On 04.08.2016 10:14, Alexey Kozlov wrote:

@amkozlov https://github.com/amkozlov : I guess |sites| would indeed make sense. I vaguely remember I wanted to change sites to |unsigned long| last year.

Great.

As for the taxa, have you got a data set with over 4G of sequences?

Not yet.

Even raw reads are in the order of hundreds of millions...

Well, one could combine reads from multiple runs for a meta-study, we have developments going into this direction in one of the projects (UniEuk). But this is more of a theoretical possibility so far and could be most probably solved in a different way (streaming, clustering). So I'm not insisting on this.

Anyway, we can discuss about data changes at the meetings, and do any changes after the first public release, as data type changes may (not sure yet) require quite a lot of work and validation.

Just thought that maybe you'd like to avoid changing the interface afterwards... but I agree it would require too many changes, so probably unrealistic before release.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xflouris/libpll/issues/22#issuecomment-237484310, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1w-sxe5jsqpE877TSZXgvTGsc6ZGJhks5qcZ9rgaJpZM4E5uMP.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

stamatak commented 6 years ago

this needs to be done if it has not already happened, critical for code stability