willfischer commented 4 years ago

Is it possible to add parsimony branch lengths to a topology-only tree?

This would be a useful example.

e.g. for the following tree: (seq1,(seq2,(seq4,(seq3,(seq5,seq6)))));

and sequences:

seq1 TTTGACAGCAGCCTAGCATTTCGTCACGTGGCCCGAGAGCTGCATCCGGA seq2 TTcGACAGCAGCCTAGCATTTCGTCACGTGGCCCGAGAGCTGCATCCGGA seq3 TTcGACAGCAGCCTAGCAaTTCGTCACGTGGCCCGAGAGCTGCATCCGGA seq4 TTcGACAGCAGCCTAGCAaTTCGTCACGTGGCCCGAGAGCTGCATCCGGA seq5 TTcGACAGCAGCCTAGCAaTTCGTCACGTGGCtCGAGAGCTGCATCCGGA seq6 TTTGACAGCAGCCTAGCAaTTCGTCACGTGGCtCGAGAGCTGCATCCGGA

computations commented 4 years ago

Hello Will,

I am a little unclear about the context of your question. LibPLL is a low level library, so optimizations like computing branch lengths are done at a higher level. As such this is a bit outside the scope of LibPLL, but fortunately this functionality exists in raxml-ng, if you have the tree and alignment as files. Just pass the tree in using the --tree switch, and add the --evaluate argument as well.

If you have any more questions, please ask.

willfischer commented 4 years ago

Hello (Tomas?),

Thanks for your rapid response. The context is parsimony trees based on rather large SARS-CoV-2 alignments (40,000 sequences x 30,000 columns); we’re using Goloboff’s "oblong" for parsimony search using sequence sets reduced to parsimony-informative sites and then made unique . The resulting tree gets expanded to include the full sequences, and at that point I need to recompute branch lengths — and I’d thought to use functions in libpll to do that.

If, as you suggest, raxml-ng can compute parsimony branch lengths for these rather large trees, all will be well. Thanks for the suggestion, I’ll try to make it work.

Best regards,

— Will Fischer

Will Fischer, PhD

net: wfischer@lanl.govmailto:wfischer@lanl.gov cell: 505-577-0592

Group T-6, Theoretical Biology Los Alamos National Laboratory

On Aug 11, 2020, at 12:43 PM, computations notifications@github.com<mailto:notifications@github.com> wrote:

Hello Will,

I am a little unclear about the context of your question. LibPLL is a low level library, so optimizations like computing branch lengths are done at a higher level. As such this is a bit outside the scope of LibPLL, but fortunately this functionality exists in raxml-nghttps://github.com/amkozlov/raxml-ng, if you have the tree and alignment as files. Just pass the tree in using the --tree switch, and add the --evaluate argument as well.

If you have any more questions, please ask.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xflouris/libpll-2/issues/18#issuecomment-672184718, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2HLA37NHVGTKTEK7MS25TSAGGNJANCNFSM4P3LDFIA.

willfischer commented 4 years ago

Hello again,

I had looked at this before — can raxml-ng compute parsimony branch lengths using the --tree and --evaluate flags? I was unable to get that to work.

Best regards,

— Will

On Aug 11, 2020, at 1:04 PM, Fischer, William Mclean wfischer@lanl.gov<mailto:wfischer@lanl.gov> wrote:

Hello (Tomas?),

Thanks for your rapid response. The context is parsimony trees based on rather large SARS-CoV-2 alignments (40,000 sequences x 30,000 columns); we’re using Goloboff’s "oblong" for parsimony search using sequence sets reduced to parsimony-informative sites and then made unique . The resulting tree gets expanded to include the full sequences, and at that point I need to recompute branch lengths — and I’d thought to use functions in libpll to do that.

If, as you suggest, raxml-ng can compute parsimony branch lengths for these rather large trees, all will be well. Thanks for the suggestion, I’ll try to make it work.

Best regards,

— Will Fischer

Will Fischer, PhD

net: wfischer@lanl.govmailto:wfischer@lanl.gov cell: 505-577-0592

Group T-6, Theoretical Biology Los Alamos National Laboratory

On Aug 11, 2020, at 12:43 PM, computations notifications@github.com<mailto:notifications@github.com> wrote:

Hello Will,

I am a little unclear about the context of your question. LibPLL is a low level library, so optimizations like computing branch lengths are done at a higher level. As such this is a bit outside the scope of LibPLL, but fortunately this functionality exists in raxml-nghttps://github.com/amkozlov/raxml-ng, if you have the tree and alignment as files. Just pass the tree in using the --tree switch, and add the --evaluate argument as well.

If you have any more questions, please ask.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xflouris/libpll-2/issues/18#issuecomment-672184718, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2HLA37NHVGTKTEK7MS25TSAGGNJANCNFSM4P3LDFIA.

xflouris commented 4 years ago

Hello @willfischer ,

I'm not sure what you ask is implemented in raxml-ng. As Ben pointed out, the --evaluate can be used in raxml-ng to evaluate the tree, but I think it's only available for ML.

In general, we have functions in libpll that compute the parsimony score of a tree and such functions are probably used in raxml-ng for quickly creating starting trees using methods like randomized stepwise addition.

Parsimony functions implemented in libpll follow the minimum evolution principle as an approximation to ML, and what the functions give is the minimum number of mutations required for the evolutionary process to give the sequences. If I understand your issue correctly, replacing the parsimony informative columns with all columns in the original alignment won't change the score. I suspect what you are interested in is some kind of expected number of mutations per site as branch lengths, which you probably already have for the reduced alignment, computed as y=branch_mutations / sites. Adding x constant sites makes sense and you can easily obtain the new branches as y'=y * sites/x since no mutations are introduced. For other uninformative sites e.g. aaaabcd choose the character appearing most times ('a' in the example) and add one mutation to each terminal branch corresponding to the other characters (b c and d). Finally, compute the expected mutations in a similar way e.g. y'=(y*sites + new_mutations)/x where x is the total number of sites. Does it make sense what I'm writing?

Best, Tomas

computations commented 4 years ago

Hello Will,

Sorry, I am Ben. I need to be really clear here that raxml-ng will compute branch lengths under a specified substitution model, and so cannot really be considered "parsimony branch lengths". To be honest, I am not entirely sure what you mean by this. I personally have never heard of researchers computing branch lengths under a parsimony assumption, and most of the literature that I can find on this topic suggests that distances are computed under some Markov substitution process, such as JC. If you want the minimum number of substitutions required the JC model should suit your needs for short distances.

At this point, I feel obligated to inform you that there is a good chance that even on the SARS-CoV-2 data of underestimated branch lengths when using parsimony, even though the distances between most of the sequences is small. Furthermore, in our own investigations on the SARS-CoV-2 data that the topologies produced by phylogenetic methods are highly suspect. In particular, there are likely to be many trees which either maximize the likelihood. I suspect that in this case, since the distances between the sequences are so small, the most parsimonious tree is similar to the maximum likelihood tree. This is to say, the parsimony surface is likely to exhibit many maxima, and it will be difficult to distinguish between different topologies from the signal present in the sequences alone.

Now that I have said that, if you really want the minimum number of substitutions required for a given tree, there is an existing bit of code that I can adapt to do this if you need it. (This was written before @xflouris wrote his example, but the offer still stand if you need it).

Best, Ben

willfischer commented 4 years ago

Hi Ben,

Thanks for this thoughtful response. I agree that there are significant difficulties with computing phylogenies in the highly sampled, low mutation rate SARS-CoV-2 datasets. In this context, it is extremely difficult, as you likely recognize, to distinguish between repeated de novo mutations and shared derived mutations, particularly when recombination occurs (as we know it does).

In actual fact, we are not trying to estimate phylogenies that are absolutely correct, but simply to cluster sequences by similarity to make analysis more convenient. It is unquestionable that there will be many most-parsimonious trees, as there will be very many nearly-most-likely trees (the likelihood and parsimony surfaces are both very nearly flat). For all its failings, the parsimony criterion is a simple set of assumptions, not vulnerable to overfitting, and with a substantial history (in the older literature). I’ve been using PAUP* for these purposes, but have run into a hard-coded limit (2**14 taxa) which is impractical to re-code.

In the end, yes, I really really do want (being very specific) to generate, for a given topology, branch lengths defined by the minimum number of substitutions. If you have code that could be trivially adapted to do this, I would be most grateful.

Very best regards,

— Will

On Aug 11, 2020, at 2:02 PM, computations notifications@github.com<mailto:notifications@github.com> wrote:

Hello Will,

Sorry, I am Ben. I need to be really clear here that raxml-ng will compute branch lengths under a specified substitution model, and so cannot really be considered "parsimony branch lengths". To be honest, I am not entirely sure what you mean by this. I personally have never heard of researchers computing branch lengths under a parsimony assumption, and most of the literature that I can find on this topic suggests that distances are computed under some Markov substitution process, such as JC. If you want the minimum number of substitutions required the JC model should suit your needs for short distances.

At this point, I feel obligated to inform you that there is a good chance that even on the SARS-CoV-2 data of underestimated branch lengths when using parsimony, even though the distances between most of the sequences is small. Furthermore, in our own investigationshttps://www.biorxiv.org/content/10.1101/2020.08.05.239046v1 on the SARS-CoV-2 data that the topologies produced by phylogenetic methods are highly suspect. In particular, there are likely to be many trees which either maximize the likelihood. I suspect that in this case, since the distances between the sequences are so small, the most parsimonious tree is similar to the maximum likelihood tree. This is to say, the parsimony surface is likely to exhibit many maxima, and it will be difficult to distinguish between different topologies from the signal present in the sequences alone.

Now that I have said that, if you really want the minimum number of substitutions required for a given tree, there is an existing bit of code that I can adapt to do this if you need it.

Best, Ben

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/xflouris/libpll-2/issues/18#issuecomment-672248869, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2HLA35B7HOHOR44KZIHJ3SAGPUHANCNFSM4P3LDFIA.

willfischer commented 4 years ago

Hello Tomas,

Thanks for this response — it is a pretty fair statement of my desires. As you say, the internal branches won’t change for the full dataset, and the lengths of each terminal branch need to be incremented to account for the non-informative mutations in each sequence. My problem is that I do not in fact have the branch lengths, and would like to compute them for the given topology.

I think I can put together some code using "set_missing_branch_length" (from examples/newick-fasta-rooted/newick-fasta-rooted.c in libpll-2), but (1) I’m not terribly adept with C, and (2) it isn’t clear how to compute the branch lengths — which was really my question. Can you give me any guidance on this approach?

On the other hand, Ben suggested he has code he can adapt for this, and I’d be thrilled if that works — so I can stop flailing and get back to data analysis ...

Thanks for your thoughts,

— Will

Will Fischer, PhD

net: wfischer@lanl.govmailto:wfischer@lanl.gov cell: 505-577-0592

Group T-6, Theoretical Biology Los Alamos National Laboratory

On Aug 11, 2020, at 2:00 PM, Tomas Flouri notifications@github.com<mailto:notifications@github.com> wrote:

Hello @willfischerhttps://github.com/willfischer ,

I'm not sure what you ask is implemented in raxml-ng. As Ben pointed out, the --evaluate can be used in raxml-ng to evaluate the tree, but I think it's only available for ML.

In general, we have functions in libpll that compute the parsimony score of a tree and such functions are probably used in raxml-ng for quickly creating starting trees using methods like randomized stepwise addition.

Parsimony functions implemented in libpll follow the minimum evolution principle as an approximation to ML, and what the functions give is the minimum number of mutations required for the evolutionary process to give the sequences. If I understand your issue correctly, replacing the parsimony informative columns with all columns in the original alignment won't change the score. I suspect what you are interested in is some kind of expected number of mutations per site as branch lengths, which you probably already have for the reduced alignment, computed as y=branch_mutations / sites. Adding x constant sites makes sense and you can easily obtain the new branches as y'=y sites/x since no mutations are introduced. For other uninformative sites e.g. aaaabcd choose the character appearing most times ('a' in the example) and add one mutation to each terminal branch corresponding to the other characters (b c and d). Finally, compute the expected mutations in a similar way e.g. y'=(ysites + new_mutations)/x where x is the total number of sites. Does it make sense what I'm writing?

Best, Tomas

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/xflouris/libpll-2/issues/18#issuecomment-672248191, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2HLAZTIZ5GMA7PLJBGLCDSAGPPFANCNFSM4P3LDFIA.

computations commented 4 years ago

In regards to the choice of parsimony, I think it is a fine choice for this application (I did some snooping and found what I strongly suspect is your preprint for this paper), but I just wanted to be sure you were aware of the problems. I wish you luck in your endeavors!

Here is a quick little program that was adapted from @xflouris examples. I didn't work to hard on it, so I don't offer any strong guarantees here. To build it, in the root directory of libpll run

cmake -Bbuild -H.
pushd build
make
popd
gcc npr-pars.c -Isrc build/src/libpll.a -lm

where npr-pars.c will be what you choose to name the code below. A few caveats: It only works on rooted binary trees, and it only accepts phylip format files. It also computes the parsimony score associated with a node and not a branch. The argument order is [tree] [alignment]. Also, this is uncorrected by the number of sites. If you want to correct it on line 215 change parscore to (double)parscore/(double)pars->sites

I know this is a bit rough, but I hope this helps! Ben

EDIT: the previous code posted here computed the total parsimony score for the subtree. This new version computes the marginal parsimony score instead.

/*
    Copyright (C) 2015 Tomas Flouri

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU Affero General Public License as
    published by the Free Software Foundation, either version 3 of the
    License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU Affero General Public License for more details.

    You should have received a copy of the GNU Affero General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

    Contact: Tomas Flouri <Tomas.Flouri@h-its.org>,
    Exelixis Lab, Heidelberg Instutute for Theoretical Studies
    Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
*/

#include "pll.h"
#include <search.h>
#include <stdarg.h>
#include <time.h>

#define STATES 4
#define MAP pll_map_nt

static void fatal(const char *format, ...) __attribute__((noreturn));

static void *xmalloc(size_t size) {
  void *t;
  t = malloc(size);
  if (!t)
    fatal("Unable to allocate enough memory.");

  return t;
}

static char *xstrdup(const char *s) {
  size_t len = strlen(s);
  char *p = (char *)xmalloc(len + 1);
  return strcpy(p, s);
}

/* a callback function for performing a full traversal */
static int cb_full_traversal(pll_rnode_t *node) { return 1; }

static void fatal(const char *format, ...) {
  va_list argptr;
  va_start(argptr, format);
  vfprintf(stderr, format, argptr);
  va_end(argptr);
  fprintf(stderr, "\n");
  exit(EXIT_FAILURE);
}

int main(int argc, char *argv[]) {
  unsigned int i, j, n;
  unsigned int tip_nodes_count, inner_nodes_count, nodes_count, branch_count;
  unsigned int ops_count;
  pll_parsimony_t *pars;
  pll_pars_buildop_t *operations;
  pll_pars_recop_t *recops;
  pll_rnode_t **travbuffer;

  /* we accept only two arguments - the newick tree (unrooted binary) and the
     alignment in the form of PHYLIP reads */
  if (argc != 3)
    fatal(" syntax: %s [newick] [phylip]", argv[0]);

  /* parse the unrooted binary tree in newick format, and store the number
     of tip nodes in tip_nodes_count */
  pll_rtree_t *tree = pll_rtree_parse_newick(argv[1]);
  if (!tree)
    fatal("Tree must be a rooted binary tree");

  /* compute and show node count information */
  tip_nodes_count = tree->tip_count;
  inner_nodes_count = tip_nodes_count - 1;
  nodes_count = inner_nodes_count + tip_nodes_count;
  branch_count = nodes_count - 1;

  printf("Number of tip/leaf nodes in tree: %d\n", tip_nodes_count);
  printf("Number of inner nodes in tree: %d\n", inner_nodes_count);
  printf("Total number of nodes in tree: %d\n", nodes_count);
  printf("Number of branches in tree: %d\n", branch_count);

  /* Uncomment to display the parsed tree ASCII tree together with information
     as to which CLV index, branch length and label is associated with each
     node. The code will also write (and print on screen) the newick format
     of the tree.

  pll_utree_show_ascii(tree->nodes[nodes_count-1],
                       PLL_UTREE_SHOW_LABEL |
                       PLL_UTREE_SHOW_BRANCH_LENGTH |
                       PLL_UTREE_SHOW_CLV_INDEX);
  char * newick = pll_utree_export_newick(tree->nodes[nodes_count-1],NULL);
  printf("%s\n", newick);
  free(newick);

  */

  /* create a libc hash table of size tip_nodes_count */
  hcreate(tip_nodes_count);

  /* populate a libc hash table with tree tip labels */
  unsigned int *data =
      (unsigned int *)xmalloc(tip_nodes_count * sizeof(unsigned int));
  for (i = 0; i < tip_nodes_count; ++i) {
    data[i] = tree->nodes[i]->clv_index;
    ENTRY entry;
#ifdef __APPLE__
    entry.key = xstrdup(tree->nodes[i]->label);
#else
    entry.key = tree->nodes[i]->label;
#endif
    entry.data = (void *)(data + i);
    hsearch(entry, ENTER);
  }

  /* read PHYLIP alignment */
  pll_phylip_t *fd = pll_phylip_open(argv[2], pll_map_phylip);
  if (!fd)
    fatal(pll_errmsg);

  pll_msa_t *msa = pll_phylip_parse_interleaved(fd);
  if (!msa)
    fatal(pll_errmsg);

  pll_phylip_close(fd);

  if ((unsigned int)(msa->count) != tip_nodes_count)
    fatal("Number of sequences does not match number of leaves in tree");

  /* create the PLL parsimony instance

     parameters:

     tips : number of tip sequences we want to have
     states : number of states that our data have
     sites : number of sites
     score_matrix : score matrix to use
     score_buffers: number of score buffers to allocate (typically one per
                    inner node)
     ancestral_buffers: number of ancestral state buffers to allocate
                        (typically one per inner node)
  */

  /* set a matrix where mutations are penalized equally by 1 */
  double score_matrix[STATES * STATES];
  for (i = 0; i < STATES * STATES; ++i)
    score_matrix[i] = 1;
  for (i = 0; i < STATES; ++i)
    score_matrix[i * STATES + i] = 0;

  pars =
      pll_parsimony_create(tip_nodes_count, STATES, (unsigned int)(msa->length),
                           score_matrix, inner_nodes_count, inner_nodes_count);

  /* find sequences in hash table and link them with the corresponding taxa */
  for (i = 0; i < tip_nodes_count; ++i) {
    ENTRY query;
    query.key = msa->label[i];
    ENTRY *found = NULL;

    found = hsearch(query, FIND);

    if (!found)
      fatal("Sequence with header %s does not appear in the tree",
            msa->label[i]);

    unsigned int tip_clv_index = *((unsigned int *)(found->data));

    // printf("Setting sequence: %s\n", msa->sequence[i]);
    pll_set_parsimony_sequence(pars, tip_clv_index, MAP, msa->sequence[i]);
  }

  pll_msa_destroy(msa);

  /* destroy hash table */
  hdestroy();

  /* we no longer need these two arrays (keys and values of hash table... */
  free(data);

  /* allocate a buffer for storing pointers to nodes of the tree in postorder
     traversal */
  travbuffer = (pll_rnode_t **)xmalloc(nodes_count * sizeof(pll_rnode_t *));

  operations = (pll_pars_buildop_t *)xmalloc(inner_nodes_count *
                                             sizeof(pll_pars_buildop_t));

  /* perform a postorder traversal of the rooted tree */
  unsigned int traversal_size;
  if (!pll_rtree_traverse(tree->root, PLL_TREE_TRAVERSE_POSTORDER,
                          cb_full_traversal, travbuffer, &traversal_size))
    fatal("Function pll_rtree_traverse() requires inner nodes as parameters");

  /* given the computed traversal descriptor, generate the build operations
     structure */
  pll_rtree_create_pars_buildops(travbuffer, traversal_size, operations,
                                 &ops_count);

  // printf("Traversal size: %d\n", traversal_size);
  // printf("Operations: %d\n", ops_count);

  double minscore = pll_parsimony_build(pars, operations, ops_count);
  printf("Minimum parsimony score: %f\n", minscore);

  for (i = 0; i < traversal_size; ++i) {
    unsigned int id = travbuffer[i]->clv_index;
    uint64_t parscore = pll_parsimony_score(pars, id);
    travbuffer[i]->length = parscore;
  }

  // correct the scores so that they are marginal

  for (i = 0; i < traversal_size; ++i) {
    pll_rnode_t *cur = travbuffer[i];
    if (cur->left)
      cur->length -= cur->left->length;
    if (cur->right)
      cur->length -= cur->right->length;
  }

  char *newick_string = pll_rtree_export_newick(tree->root, 0);
  printf("%s\n", newick_string);

  /* free traversal, build operations and reconstruct operations */
  free(travbuffer);
  free(operations);
  free(newick_string);

  /* we will no longer need the tree structure */
  pll_rtree_destroy(tree, NULL);

  /* destroy the parsimony structure */
  pll_parsimony_destroy(pars);

  return (EXIT_SUCCESS);
}

willfischer commented 4 years ago

Thanks, Ben,

I’ll build this and check it out!

Very much obliged!

— Will

On Aug 12, 2020, at 1:58 AM, computations notifications@github.com<mailto:notifications@github.com> wrote:

Here is a quick little program that was adapted from @xflourishttps://github.com/xflouris examples. I didn't work to hard on it, so I don't offer any strong guarantees here. To build it, in the root directory of libpll run

cmake -Bbuild -H. pushd build make popd gcc npr-pars.c -Isrc build/src/libpll.a -lm

where npr-pars.c will be what you choose to name the code below. A few caveats: It only works on rooted binary trees, and it only accepts phylip format files. The argument order is [tree] [alignment].

I know this is a bit rough, but I hope this helps! Ben

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

Contact: Tomas Flouri <Tomas.Flouri@h-its.org<mailto:Tomas.Flouri@h-its.org>>,
Exelixis Lab, Heidelberg Instutute for Theoretical Studies
Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany

*/

include "pll.h"

include

define STATES 4

define MAP pll_map_nt

static void fatal(const char *format, ...) attribute((noreturn));

static void xmalloc(size_t size) { void t; t = malloc(size); if (!t) fatal("Unable to allocate enough memory.");

return t; }

static char xstrdup(const char s) { size_t len = strlen(s); char p = (char )xmalloc(len + 1); return strcpy(p, s); }

/ a callback function for performing a full traversal / static int cb_full_traversal(pll_rnode_t *node) { return 1; }

static void fatal(const char *format, ...) { va_list argptr; va_start(argptr, format); vfprintf(stderr, format, argptr); va_end(argptr); fprintf(stderr, "\n"); exit(EXIT_FAILURE); }

int main(int argc, char argv[]) { unsigned int i, j, n; unsigned int tip_nodes_count, inner_nodes_count, nodes_count, branch_count; unsigned int ops_count; pll_parsimony_t pars; pll_pars_buildop_t operations; pll_pars_recop_t recops; pll_rnode_t **travbuffer;

/ we accept only two arguments - the newick tree (unrooted binary) and the alignment in the form of PHYLIP reads / if (argc != 3) fatal(" syntax: %s [newick] [phylip]", argv[0]);

/ parse the unrooted binary tree in newick format, and store the number of tip nodes in tip_nodes_count / pll_rtree_t *tree = pll_rtree_parse_newick(argv[1]); if (!tree) fatal("Tree must be a rooted binary tree");

/ compute and show node count information / tip_nodes_count = tree->tip_count; inner_nodes_count = tip_nodes_count - 1; nodes_count = inner_nodes_count + tip_nodes_count; branch_count = nodes_count - 1;

printf("Number of tip/leaf nodes in tree: %d\n", tip_nodes_count); printf("Number of inner nodes in tree: %d\n", inner_nodes_count); printf("Total number of nodes in tree: %d\n", nodes_count); printf("Number of branches in tree: %d\n", branch_count);

/* Uncomment to display the parsed tree ASCII tree together with information as to which CLV index, branch length and label is associated with each node. The code will also write (and print on screen) the newick format of the tree.

pll_utree_show_ascii(tree->nodes[nodes_count-1], PLL_UTREE_SHOW_LABEL | PLL_UTREE_SHOW_BRANCH_LENGTH | PLL_UTREE_SHOW_CLV_INDEX); char * newick = pll_utree_export_newick(tree->nodes[nodes_count-1],NULL); printf("%s\n", newick); free(newick);

*/

/ create a libc hash table of size tip_nodes_count / hcreate(tip_nodes_count);

/ populate a libc hash table with tree tip labels / unsigned int data = (unsigned int )xmalloc(tip_nodes_count * sizeof(unsigned int)); for (i = 0; i < tip_nodes_count; ++i) { data[i] = tree->nodes[i]->clv_index; ENTRY entry;

ifdef APPLE

entry.key = xstrdup(tree->nodes[i]->label);

else

entry.key = tree->nodes[i]->label;

endif

entry.data = (void *)(data + i);
hsearch(entry, ENTER);

}

/ read PHYLIP alignment / pll_phylip_t *fd = pll_phylip_open(argv[2], pll_map_phylip); if (!fd) fatal(pll_errmsg);

pll_msa_t *msa = pll_phylip_parse_interleaved(fd); if (!msa) fatal(pll_errmsg);

pll_phylip_close(fd);

if ((unsigned int)(msa->count) != tip_nodes_count) fatal("Number of sequences does not match number of leaves in tree");

/* create the PLL parsimony instance

 parameters:

 tips : number of tip sequences we want to have
 states : number of states that our data have
 sites : number of sites
 score_matrix : score matrix to use
 score_buffers: number of score buffers to allocate (typically one per
                inner node)
 ancestral_buffers: number of ancestral state buffers to allocate
                    (typically one per inner node)

*/

/ set a matrix where mutations are penalized equally by 1 / double score_matrix[STATES STATES]; for (i = 0; i < STATES STATES; ++i) score_matrix[i] = 1; for (i = 0; i < STATES; ++i) score_matrix[i * STATES + i] = 0;

pars = pll_parsimony_create(tip_nodes_count, STATES, (unsigned int)(msa->length), score_matrix, inner_nodes_count, inner_nodes_count);

/ find sequences in hash table and link them with the corresponding taxa / for (i = 0; i < tip_nodes_count; ++i) { ENTRY query; query.key = msa->label[i]; ENTRY *found = NULL;

found = hsearch(query, FIND);

if (!found)
  fatal("Sequence with header %s does not appear in the tree",
        msa->label[i]);

unsigned int tip_clv_index = *((unsigned int *)(found->data));

//printf("Setting sequence: %s\n", msa->sequence[i]);
pll_set_parsimony_sequence(pars, tip_clv_index, MAP, msa->sequence[i]);

}

pll_msa_destroy(msa);

/ destroy hash table / hdestroy();

/ we no longer need these two arrays (keys and values of hash table... / free(data);

/ allocate a buffer for storing pointers to nodes of the tree in postorder traversal / travbuffer = (pll_rnode_t *)xmalloc(nodes_count sizeof(pll_rnode_t *));

operations = (pll_pars_buildop_t )xmalloc(inner_nodes_count sizeof(pll_pars_buildop_t));

/ perform a postorder traversal of the rooted tree / unsigned int traversal_size; if (!pll_rtree_traverse(tree->root, PLL_TREE_TRAVERSE_POSTORDER, cb_full_traversal, travbuffer, &traversal_size)) fatal("Function pll_rtree_traverse() requires inner nodes as parameters");

/ given the computed traversal descriptor, generate the build operations structure / pll_rtree_create_pars_buildops(travbuffer, traversal_size, operations, &ops_count);

//printf("Traversal size: %d\n", traversal_size); //printf("Operations: %d\n", ops_count);

double minscore = pll_parsimony_build(pars, operations, ops_count); printf("Minimum parsimony score: %f\n", minscore);

for (i = 0; i < traversal_size; ++i) { unsigned int id = travbuffer[i]->clv_index; uint64_t parscore = pll_parsimony_score(pars, id); travbuffer[i]->length = parscore; }

char* newick_string = pll_rtree_export_newick(tree->root, 0); printf("%s\n", newick_string);

/ free traversal, build operations and reconstruct operations / free(travbuffer); free(operations); free(newick_string);

/ we will no longer need the tree structure / pll_rtree_destroy(tree, NULL);

/ destroy the parsimony structure / pll_parsimony_destroy(pars);

return (EXIT_SUCCESS); }

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/xflouris/libpll-2/issues/18#issuecomment-672707040, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2HLAYX7RQKU7NUTOF4MLTSAJDTFANCNFSM4P3LDFIA.

willfischer commented 4 years ago

Hello again, Ben,

I’m sorry to bother you again, but the output from npr-pars is not what I’d expect. It compiles and runs cleanly, but the computed branch lengths are odd.

With the attached toy example, I’d expect this output (see also attached tenseqs.tree.pdf) — this is from PAUP*:

(seq01:0,(((seq02:1,seq03:1):1,(seq04:0,(seq05:0,seq06:0):1):1):1,(seq07:1,(seq08:0,(seq09:0,seq10:1):1):1):3):1);

but the code produces this (right-of-decimal zeros removed from branch lengths; see also attached tenseqs.npr_pars.tree.pdf):

(seq01:0,(((seq02:0,seq03:0):2,(seq04:0,(seq05:0,seq06:0):0):1):5,(seq07:0,(seq08:0,(seq09:0,seq10:0):1):2):4):13):14;

So it looks like the postorder traversal is adding up changes from tips to root (but skipping the terminal-branch changes). Since I’m simple-minded, I tried changing to a preorder traversal, but it only populated a few of the internal branch lengths.

tenseq.tree and tenseqs.tree.pdf are what I’d hope to get — tenseqs.npr_pars.tree and tenseqs.npr_pars.tree.pdf are what I get.

I realize you are helping out of the goodness of your heart — is there enough goodness left to help a little further?

Thanks again and best regards,

— Will [cid:9edc767d-bfd2-47f4-befc-9a10011fac6c@win.lanl.gov]

On Aug 12, 2020, at 8:41 AM, Fischer, William Mclean wfischer@lanl.gov<mailto:wfischer@lanl.gov> wrote:

Thanks, Ben,

I’ll build this and check it out!

Very much obliged!

— Will

On Aug 12, 2020, at 1:58 AM, computations notifications@github.com<mailto:notifications@github.com> wrote:

Here is a quick little program that was adapted from @xflourishttps://github.com/xflouris examples. I didn't work to hard on it, so I don't offer any strong guarantees here. To build it, in the root directory of libpll run

cmake -Bbuild -H. pushd build make popd gcc npr-pars.c -Isrc build/src/libpll.a -lm

where npr-pars.c will be what you choose to name the code below. A few caveats: It only works on rooted binary trees, and it only accepts phylip format files. The argument order is [tree] [alignment].

I know this is a bit rough, but I hope this helps! Ben

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

Contact: Tomas Flouri <Tomas.Flouri@h-its.org<mailto:Tomas.Flouri@h-its.org>>,
Exelixis Lab, Heidelberg Instutute for Theoretical Studies
Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany