Open xflouris opened 9 years ago
I would suggest to use type definitions such that if for some reason we change our mind later we don't need to update the whole library.
Most of changes and typedefs make sense to me.
Is there a reason that rate_cats
, rate_matrices
and states
should not be unsigned int
s? I realize that they won't hit limit where the extra bit will matter. But, they can't really be negative, so unsigned seems to make sense.
If using stdbool.h
is not a problem, then it seems like eigen_decomp_valid
could be bool *
I agree changing the types related to the number of sites, but I guess an unsigned integer is more than enough for the variables related to the number of taxa or nodes (tips / scale buffers / clvs/ ...)
http://www.nature.com/news/2011/110823/full/news.2011.498.html
@ddarriba sure, but we have to decide on the data types first.
@mtholder No reason, rate_cats
and rate_matrices
should be unsigned int
(updated in the table)
I'll check for the portability of uint64_t (stdint.h) and bool (stdbool.h) for Win/Mac and make a decision whether to typedef new data types or use the existing ones from c99.
New data types table as discussed with Alexis and Diego:
type | Variable | Role | Suggested type |
---|---|---|---|
int |
tips |
Number of tip sequences | unsigned int |
int |
clv_buffers |
Number of CLV buffers (for inner nodes) | unsigned int |
int |
states |
Number of states | unsigned int |
int |
sites |
Number of sites in partition | unsigned int |
int |
rate_matrices |
Number of different rate matrices that will be used | unsigned int |
int |
prob_matrices |
Number of probability matrices to allocate | unsigned int |
int |
rate_cats |
Number of rate categories | unsigned int |
int |
scale_buffers |
Number of scalers | unsigned int |
unsigned int ** |
scale_buffer |
Scalers | `unsigned int `** |
int * |
invariant |
Indicates whether a site is invariant or not | int * |
int * |
eigen_decomp_valid |
Indicates whether the eigen decomposition for a pmatrix is valid | *`bool `** |
finished. eigen_decomp_valid kept to int. We may change data types according to C99 standard in the future or use typedefs.
Sorry for reopening this old issue, but probably we should consider using unsigned long
at least for the number of sites? There were already a couple of user requests on raxml google groups about multi-GB analyses...
For the tips/taxa count, it's less probable to hit the 4G limit, but still possible: taxa are not necessarily species, but could be also individuals, viral sequences, raw reads etc. etc.
@amkozlov : I guess sites
would indeed make sense. I vaguely remember I wanted to change sites to unsigned long
last year. As for the taxa, have you got a data set with over 4G of sequences? Even raw reads are in the order of hundreds of millions...
Anyway, we can discuss about data changes at the meetings, and do any changes after the first public release, as data type changes may (not sure yet) require quite a lot of work and validation.
@amkozlov https://github.com/amkozlov : I guess |sites| would indeed make sense. I vaguely remember I wanted to change sites to |unsigned long| last year.
Great.
As for the taxa, have you got a data set with over 4G of sequences?
Not yet.
Even raw reads are in the order of hundreds of millions...
Well, one could combine reads from multiple runs for a meta-study, we have developments going into this direction in one of the projects (UniEuk). But this is more of a theoretical possibility so far and could be most probably solved in a different way (streaming, clustering). So I'm not insisting on this.
Anyway, we can discuss about data changes at the meetings, and do any changes after the first public release, as data type changes may (not sure yet) require quite a lot of work and validation.
Just thought that maybe you'd like to avoid changing the interface afterwards... but I agree it would require too many changes, so probably unrealistic before release.
we definitely need unsigned long for the sites, I had to change this at some point in ExaML because the integers were not sufficient any more, regarding the taxa it may make sense as well, especially for the EPA
alexis
On 04.08.2016 10:14, Alexey Kozlov wrote:
@amkozlov https://github.com/amkozlov : I guess |sites| would indeed make sense. I vaguely remember I wanted to change sites to |unsigned long| last year.
Great.
As for the taxa, have you got a data set with over 4G of sequences?
Not yet.
Even raw reads are in the order of hundreds of millions...
Well, one could combine reads from multiple runs for a meta-study, we have developments going into this direction in one of the projects (UniEuk). But this is more of a theoretical possibility so far and could be most probably solved in a different way (streaming, clustering). So I'm not insisting on this.
Anyway, we can discuss about data changes at the meetings, and do any changes after the first public release, as data type changes may (not sure yet) require quite a lot of work and validation.
Just thought that maybe you'd like to avoid changing the interface afterwards... but I agree it would require too many changes, so probably unrealistic before release.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xflouris/libpll/issues/22#issuecomment-237484310, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1w-sxe5jsqpE877TSZXgvTGsc6ZGJhks5qcZ9rgaJpZM4E5uMP.
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson
www.exelixis-lab.org
this needs to be done if it has not already happened, critical for code stability
Perhaps now it would be a good idea to decide the data types of the following
pll_partition_t
variables:int
tips
unsigned long
int
clv_buffers
unsigned long
int
states
int
int
sites
unsigned long
int
rate_matrices
unsigned int
int
prob_matrices
unsigned long
int
rate_cats
unsigned int
int
scale_buffers
unsigned long
unsigned int **
scale_buffer
int *
invariant
int *
eigen_decomp_valid
int *
My justification for the proposed changes is that
int
(orunsigned int
) could be an insufficient range for large phylogenomic datasets. Changingtips
,clv_buffers
,sites
,prob_matrices
andscale_buffers
would enable the analysis of datasets comprising of more than 4,294,967,295 sites. The drawback is that the memory requirements will double.Also, if the proposed changed are done, it would also make sense to change
scale_buffers
,invariant
tounsigned long **
andunsigned long *
respectively, because the new data type (unsigned long
) would then be of the same size (8 bytes) as a CLV entry (double
), and this would improve vectorization efficiency.What do you think?