xflouris / libpll

Phylogenetic Likelihood Library
GNU Affero General Public License v3.0
27 stars 6 forks source link

Site pattern compression should work with states, not with characters #78

Closed amkozlov closed 8 years ago

amkozlov commented 8 years ago

Currently, pattern compression is implemented on a "character level", i.e. it works with the strings as they were read from a FASTA/PHYLIP file. This has at least one unpleasant side effect: if several characters are used to encode one state (e.g., - and ? for gaps), they will be considered distinct in the process of pattern compression.

Although it could be hacked around for this specific case (gaps), I guess a much cleaner solution would be to compress patterns after raw sequences have been converted to the internal PLL representation (based on partition->charmap). If my understanding is correct, this representation allows for state ambiguity, but eliminates encoding ambiguity.

stamatak commented 8 years ago

yes, good point,

alexis

On 24.04.2016 02:42, Alexey Kozlov wrote:

Currently, pattern compression is implemented on a "character level", i.e. it works with the strings as they were read from a FASTA/PHYLIP file. This has at least one unpleasant side effect: if several characters are used to encode one state (e.g., - and ? for gaps), they will be considered distinct in the process of pattern compression.

Although it could be hacked around for this specific case (gaps), I guess a much cleaner solution would be to compress patterns /after/ raw sequences have been converted to the internal PLL representation (based on partition->charmap). If my understanding is correct, this representation allows for /state ambiguity/, but eliminates /encoding ambiguity/.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/xflouris/libpll/issues/78

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

@amkozlov : I think you didn't understand it correctly. Have a look at how the re-mapping is done -- it is quite complicated. I don't know where you observed "character-level" access - this is not present in the code.

For 4x4, pattern compression is done using states which are given by the map, e.g. the predefined pll_map_nt. If - and ? are defined as one state (which in this case are), there is no problem. Have a look how pll_map_nt is defined in maps.c.

For a general number of states, you still give a map (you can define - and ? as the same symbol). But the binary representation of the state could exceed the byte range as opposed to 4x4. Imagine you have a 10 state configuration and characters ? and - are encoded as 1110000000. The map M you provide is of type long and thus you can provide this type of encoding. Since that map can have upto 256 values only (number of ascii characters), it is re-coded using a different mapping such that the values range from 0 to 255, and tip sequences are recoded using this map and stored in the tipchars array. You can think of this array as the re-coded states you provide in byte-range instead of long-range, which is possible because you have only 256 possible ASCII chars - perhaps this part is what you thought as character-level access?. This re-encoding is used to compute patterns. The CLVs obtained from the re-encoded states are computed by looking up the state from the new map (0-255) to the old map M (long). This process is slower because of the addiotional lookup, but this way we can work with arbitrary number of states and encodings.

Of course specific cases for popular data types (DNA,AA) are implemented that do not require this remapping (AA is still not there).

Let me know if this clears things up.

Note also, that the current TIP_PATTERN implementation will be changed in the sense, that only "p-matrix entry sums" for each nucleotide at each node will be pre-computed, instead of the complete CLV, which will be faster, i.e., we will compute 16 entries instead of 256 for each tip-tip node but will require an extra multiplication per site per category at the inner node. Not sure if I discussed this with Alexis only.

amkozlov commented 8 years ago

@xflouris: hm, the way you describe it sounds very similar to what I'm suggesting above, but I cannot find it in the code. It might also be that I'm using the pattern compression incorrectly. Do we actually have an example of how it is supposed to be used (or can you please give one)?

xflouris commented 8 years ago

@amkozlov : I now realize there was probably a mis-communication between us. I was under the impression you were talking about tip pattern compression (pre-computation of conditional probability vectors), but I now realize that you might have been talking about site compression (in the alignment).

If it's the former, then to use it you only pass the PLL_ATTRIB_PATTERN_TIP to the attributes of the partition instance, but I suspect you were referring to the latter.

For site compression, if you are referring to the pll_compress_site_patterns function, you are right, it is done on the 'character level'. So, indeed, if ? and - refer to the same state two sites that, for example, look like '??' and '--' will not be treated as equal. The fix is rather simple, we can require the state map to be passed to the pll_compress_site_patterns and the comparison will be made on the state map looked-up characters (states) instead of the characters.

amkozlov commented 8 years ago

Sure, I was referring to the site pattern compression, sorry if it wasn't clear enough.

The fix is rather simple, we can require the state map to be passed to the pll_compress_site_patterns and the comparison will be made on the state map looked-up characters (states) instead of the characters.

Either this, or we might think about tighter integration with pll_partition, which already holds the charmap and has the re-mapping functionality in pll_set_tip_states. Not sure what would be the best solution though: add an optional sequences parameter to pll_partition_create and perform compression there (switchable via attrs)? Or provide a special version of pll_partition_create for this purpose? Or a separate call to pll_partition_compress_patterns after all tips were set with pll_set_tip_states?

BTW: I think memory allocation for CLVs in pll_partition_create should be made optional. It is essential for implementing the RAxML-style memory-saving techniques, and will give greater flexibility in other scenarios (e.g., post-creation pattern compression as described above).

xflouris commented 8 years ago

@amkozlov : no worries, I think you were clear, it's just that I have always associated 'pattern compression' with tip-tip pre-computation, and 'site compression' with what seems to be called 'pattern compression' in the field :)

The cleaner solution is to pass the map to the compress function since the alignment is completely independent of the partition structure. After all, one may use only the core functions and completely abandon the higher-level partition structure. The (minor) disadvantage is that characters->states encoding will take place twice (once in compress and once when setting tip clvs/chars), but in the end, it's a library we're developing and we have to allow for some extra flexibility at the cost of some slow-down (non-critical in this case). Otherwise, there isn't much point in developing PLL, and we can work directly on raxml-ng. Edit: What I forgot to (explicitely) state is that the compress function will re-map the compressed alignment back to the character level, after it's done.

BTW: I think memory allocation for CLVs in pll_partition_create should be made optional. It is essential for implementing the RAxML-style memory-saving techniques, and will give greater flexibility in other scenarios (e.g., post-creation pattern compression as described above).

Tthe memory-saving techniques shouldn't be necessary as the site repeats method is superior, in both memory savings and speed. The idea is that when passing the site repeats attribute, the partition will take care of re-allocating memory as necessary - this should be transparent to the application programmer, and will work out-of-the-box within the library. But even the raxml memory-saving techniques could be implemented within the library and have the partition automatically adjust CLV sizes as necessary.

amkozlov commented 8 years ago

The cleaner solution is to pass the map to the compress function since the alignment is completely independent of the partition structure. [...] it's a library we're developing and we have to allow for some extra flexibility at the cost of some slow-down (non-critical in this case)

OK, I expected this sort of reaction :) Maybe we can implement it in the MSA module, since I guess loading alignment and compressing the sites into patterns is a very common use case, not limited to raxml alone. And it's not about speed but rather API convenience for the app. programmer.

Tthe memory-saving techniques shouldn't be necessary as the site repeats method is superior, in both memory savings and speed.

:) What about this other type of memory-saving where we only use log(n) CLVs?

The idea is that when passing the site repeats attribute, the partition will take care of re-allocating memory as necessary - this should be transparent to the application programmer, and will work out-of-the-box within the library. But even the raxml memory-saving techniques could be implemented within the library and have the partition automatically adjust CLV sizes as necessary.

it's a library we're developing and we have to allow for some extra flexibility (c)