yesdavid / designspace_culttax_article_2021

Research compendium for 'Matzig, Hussain, Riede (2021) Design space constraints and the cultural taxonomy of European Final Palaeolithic large tanged points – a comparison of typological, landmark-based and whole-outline geometric morphometric approaches'
5 stars 0 forks source link

RevBayes for Bayesian inference of phylogenies using continuous traits #1

Open benmarwick opened 3 years ago

benmarwick commented 3 years ago

There's so much fascinating work in this repo, really amazing stuff! It got me thinking about options for Bayesian inference with these data, and here's my attempt at a workflow using the RevBayes program (it's very similar to R!). I've added some comments throughout so hopefully you can make it work on your machine. Most of it is directly from Parins-Fukuchi 2018 (I know you mentioned this paper already, perhaps you've already tried it also) and the RevBayes code in her repository.


# Starting from the end of your 3_code_late_neolithic_early_bronze_age.R,
# having run all the code in that script...

# write a nex file for input into RevBayes
write.nexus.data(as.matrix(nicolas_2016_without_outliers_PCA_as_df_subset_typochron_FR),
                 file.path(output_folder, "nicolas_2016_without_outliers_PCA_as_df_subset_typochron_FR.nex"),
                 format = "continuous")

#----------------------------------------------------------------
# This is RevBayes code, not R, and needs to be run in the terminal

# This code file was downloaded and modified from: 
# Parins-Fukuchi, Caroline (2017), Data from: Use of continuous 
# traits can improve morphological phylogenetics, Dryad, 
# Dataset, https://doi.org/10.5061/dryad.40b70

# The code is cited in this paper: 
# Caroline Parins-Fukuchi, Use of Continuous Traits Can Improve Morphological 
# Phylogenetics, Systematic Biology, Volume 67, Issue 2, March 2018, 
# Pages 328–339, https://doi.org/10.1093/sysbio/syx072

## This is intended to be run on RevBayes v1.0.0
# v1.0.1 has changed several function names. see RevBayes documentation for more details. 
# This procedure was developed with the gratuitous aid of RevBayes example documents
# authored by Nicolas Lartillot, Michael Landis, and April Wright.

# BM: I downloaded RevBayes onto my computer and put in in my Applications folder (OSX)

# Assuming we are in an open RStudio Project, we can 
# start RevBayes the RStudio terminal with the commmand:

# /Applications/rb

# this should start RevBayes in the project directory

# run these lines below in the terminal after starting RevBayes (I copy-paste them
# into the terminal in RStudio)

getwd() # confirm that we are in the project directory where are input/output files are

fl <- "nicolas_2016_without_outliers_PCA_as_df_subset_typochron_FR.nex"  # continuous character set used for analysis in nexus format
outtr <- "OUTTREEFILE"   # file to write sampled trees
outlf <- "OUTLOGFILE"    # MCMC log
outsm <- "OUTSUMFILE"    # MAP summary tree file

# import the data into RevBayes, watch terminal for message indicating success
contData <- readContinuousCharacterData(fl)
# the most relevant official tutorial is probably 
# https://revbayes.github.io/tutorials/cont_traits/simple_bm.html

numTips = contData.ntaxa()
numNodes = numTips * 2 - 1
names = contData.names()
diversification ~ dnLognormal(0,1) # aka 'birth rate' for new varients 
turnover = 0 # we are going to parameterise the BD prior, 
# 'The turnover rate is the rate at which one species is replaced by another 
# species due to a birth plus death event.... the turnover rate represent the 
# longevity of a species' details: https://revbayes.github.io/tutorials/divrate/simple.html
speciation := diversification + turnover
extinction := turnover 
sampling_fraction <- 1

# instantiate a Birth-Death tree with the parameters set above
psi ~ dnBDP(lambda=speciation,                # The speciation rate
            mu=extinction,                    # The extinction rate.
            rho=sampling_fraction,            # The taxon sampling fraction
            rootAge=1, 
            samplingStrategy = "uniform",     # The sampling strategy of including taxa at the present.
            condition = "nTaxa",
            taxa=names) 
mvi = 0 # we are going to set tree rearrangement and node height scaling moves

# create our node height and topological rearrangement MCMC moves. These help us to explore
# the parameter space in each MCMC step. Good notes on these here: 
# http://www.peterbeerli.com/classes/images/5/5f/RB_CTMC_Tutorial_oct2015.pdf
moves[++mvi] = mvSubtreeScale(psi, weight=5.0) # change the ages of the internal nodes
moves[++mvi] = mvNodeTimeSlideUniform(psi, weight=10.0) # change the ages of the internal nodes
moves[++mvi] = mvNNI(psi, weight=5.0)  # nearest-neighbor interchange move
moves[++mvi] = mvFNPR(psi, weight=5.0) # a fixed-node height subtree-prune and regrafting move

logSigma ~ dnNormal(0,1) # place a prior on BM sigma parameter.
sigma := 10^logSigma
moves[++mvi] = mvSlide(logSigma, delta=1.0, tune=true, weight=2.0)

# For our MCMC analysis, we need to set up a vector of monitors to record 
# the states of our Markov chain.

monitors[1] = mnScreen(printgen=50000, sigma)
monitors[2] = mnFile(filename=outlf, printgen=400, separator = TAB, sigma)
monitors[3] = mnFile(filename=outtr, printgen=100, separator = TAB, psi)

# specify that we are going calculate BM likelihood using the REML PIC algorithm (see Felsenstein 1973)
# several chapters on fitting BM models here: https://lukejharmon.github.io/pcm/chapters/
traits ~ dnPhyloBrownianREML(psi, branchRates=1.0, siteRates=sigma, nSites=contData.nchar())

# When this clamp function is called, RevBayes sets each of the stochastic 
# nodes representing the tips of the tree to the corresponding 
# nucleotide sequence in the alignment. This essentially 
# tells the program that we have observed data for the sequences at the tips.
traits.clamp(contData) # match traits to tips

# we wrap the entire model to provide convenient access to the DAG. 
# To do this, we only need to give the model() function a 
# single node. With this node, the model() function can find all of the other
# nodes by following the arrows in the graphical model (see DOI:10.1093/sysbio/syw021
# for details of the typical graphical model)
bmv = model(sigma)     # link sigma param w/ BM model

# With a fully specified model, a set of monitors, and a set of moves, 
# we can now set up the MCMC algorithm that will sample parameter values in 
# proportion to their posterior probability. The mcmc() function will
# create our MCMC object:
chain = mcmc(bmv, monitors, moves)
chain.burnin(generations=500, tuningInterval=50) # originally 50000 and 500

chain.run(5000) # originally 500000
# !STOP! We have to rename OUTTREEFILE to OUTTREEFILE.trees before running the next line
# R code starts here to rename the file ----------------------------------------
file.rename(file.path(output_folder, "OUTTREEFILE"), 
            file.path(output_folder, "OUTTREEFILE.trees"))
# R code ends here -------------------------------------------------------------

# To summarize the trees sampled from the posterior distribution, 
# RevBayes can summarize the sampled trees by reading in the tree-trace file:
treetrace = readTreeTrace(file = outtr)
treefl <- outsm

# The mapTree() function will summarize the tree samples and write the 
# maximum a posteriori tree to file, we can also summarise trees as MCC 
# (maximum clade credibility) representation 
map = mapTree( file=treefl, treetrace )
mccTree( file="mcc.tre", treetrace )
q() # quits RevBayes

# End of RevBayes code
# resources beyond the official tutorials: 
# - https://ib.berkeley.edu/courses/ib200/labs/12/lab12.pdf
#- https://github.com/revbayes/revbayes_tutorial/blob/master/RB_PhyloComparative_BM_Tutorial/scripts/continuous_burst.Rev
# - https://github.com/search?p=3&q=dnPhyloBrownianREML&type=Code
#----------------------------------------------------------------

# ...back to R again 

outsumfile <- ape::read.nexus( file.path(output_folder, "OUTSUMFILE"))

ggtree(outsumfile) %<+% names_artefacts_ID_and_period + 
  theme_tree() + 
  geom_tiplab(size=2, 
              aes(label = ID_country,
                  color = ID_country)) +
  geom_treescale() 

# explore some plotting methods

# https://cran.r-project.org/web/packages/MCMCtreeR/vignettes/MCMCtree_plot.html
library(MCMCtreeR, quietly = TRUE, warn.conflicts = FALSE)

outsumfile.rb <- file.path(output_folder, "OUTSUMFILE")

MCMC.tree.plot(analysis.type='revbayes',
               directory.files=outsumfile.rb, 
               cex.tips=0.33,
               plot.type='phylogram', 
               lwd.bar=2,
               add.time.scale=FALSE,
               node.method='bar', 
               col.age='navy')

MCMC.tree.plot(
  directory.files=outsumfile.rb,
  analysis.type = "revbayes",
  cex.tips = 0.2,
  plot.type = "phylogram",
  lwd.bar = 2,
  node.method = "node.length",
  cex.labels = 0.5,
  add.time.scale=FALSE)

phy <- ape::read.tree(file.path(output_folder, "OUTTREEFILE.trees"))

# write it out to a format that crsl4/PhyloNetworks.jl can read
# cf. https://crsl4.github.io/PhyloNetworks.jl/latest/man/inputdata/
ape::write.tree(phy, file.path(output_folder, "phy.trees"))

ape::write.nexus(phy, file.path(output_folder, "phy.nexus"))

# I can't see any tree branches when the 
phangorn::densiTree(phy, type = "phylogram", cex = 0.5)

plot(phangorn::maxCladeCred(phy),  cex = 0.5)
plot(consensus(phy))

# Looks promising:
# https://revbayes.github.io/tutorials/intro/revgadgets.html
library(RevGadgets)

outtree.rb <- readTrees(outsumfile.rb)
ape::write.tree(outtree.rb[[1]][[1]]@phylo, file.path(output_folder, "out.trees"))

outtree.rb1 <- readTrees( file.path(output_folder, "out.trees"))

plot(outtree.rb1[[1]][[1]]@phylo)

# create the plot of the rooted tree
plot <- plotTree(tree = outtree.rb,
                 # label nodes the with posterior probabilities
                 # node_labels = "posterior", 
                 # offset the node labels from the nodes
                 node_labels_offset = 0.005,
                 # make tree lines more narrow
                 line_width = 0.5,
                 # italicize tip labels 
                 tip_labels_italics = TRUE)

# add scale bar to the tree and plot with ggtree
library(ggtree)
plot + geom_treescale(x = -0.35, y = -1)

#----------------------------------------------------------------------------
# PhyloNetworks

# prepare RevBayes output for input into PhyloNetworks 

phy <- ape::read.tree(file.path(output_folder, "OUTTREEFILE.trees"))

# write it out to a format that crsl4/PhyloNetworks.jl can read
# cf. https://crsl4.github.io/PhyloNetworks.jl/latest/man/inputdata/
# these are our 1000s of trees
ape::write.tree(phy, file.path(output_folder, "phy.trees"))

# this is our single MAP tree
library(RevGadgets)
outtree.rb <- readTrees(file.path(output_folder, "OUTSUMFILE"))
ape::write.tree(outtree.rb[[1]][[1]]@phylo, file.path(output_folder, "out.trees"))

This is the output from densiTree:

image

yesdavid commented 3 years ago

Dear @benmarwick, thank you so much for sharing your code! On first glance it looks very cryptic to me, but I'm sure your annotations will help me figuring it out eventually! Inferring phylogenies using continuous traits is something that I had already given up on, so this is a huge boost in motivation - thank you!

benmarwick commented 3 years ago

You're warmly welcome! I have just now updated my first post with additional inline comments as I learn more about the RevBayes functions. Here's a summary of the most important parts that I understand so far, largely taken word-for-word from Wright and Warnock 2020. As we see in the image below, for Bayesian inference of phylogenies, in addition to our data, we need to make decisions about three important inputs: (1) a substitution model to describe the evolution of characters (by what process did our measurement variables change in the past?), a clock model to describe the distribution of evolutionary rates across the tree (how fast did things change?), and (3) the tree model to describes the distribution of speciation events across the tree.

image

Substitution models

This describes how phylogenetic characters in the dataset evolve; how character change accumulates over time, leading to the observed phylogenetic data. Molecular sequence data has a big advantage here because there seem to be many well understood models. One I see often in the literature is the the general time reversible model (GTR), but its for nucleotides only. For discrete morphological characters, the Mk model is popular, standing for 'Markov k', This means that the probability of changing from one state to another depends only on the current state, and not on what has come before, and assumes that every state is equally likely to change to any other state. For continuous character data such as we are looking at here, we have three options: Brownian motion, or Ornstein–Uhlenbeck, or Lévy processes. In the code above we are using Brownian motion (cf. the RevBayes function dnPhyloBrownianREML), and in the reading I've done so far, this seems very common (e.g. 1, 2, and see Wright (2019) for an encouraging discussion of Bayesian inference of phylogenies using continuous characters and Brownian motion). Here's the RevBayes tutorial on Brownian motion models.

Clock Models

The clock model describes the way the rate of character change varies, or does not vary, across the tree. In the code above this are specified by branchRates (and maybe also siteRates, I'm not sure). We have a strict (or global) clock model, where all rates are assumed constant, but RevBayes allows for rate variation across the tree in several interesting ways (this seems to be an example of branch-specific rates). Here's the RevBayes tutorial on implementing different clock models.

Tree Models

We are using a birth-death process to model the probability of each species (i.e. lithic variant) either giving birth (speciating) or dying (going extinct). This is our We denote the per-lineage birth rate as λ and the per-lineage death rate as μ. We need to specify a distribution to draw a speciation rate from (λ), and a distribution to draw from for an extinction rate (μ); these are important priors for the MCMC process to search through. In the code above our birth-death model is specified in the function dnBDP.

Other interesting tree models to consider are the fossilized birth-death (FBD) process, which allows us to set dated specimens on tips or internal branches. This sounds like an exciting way to richly use lithics with well constrained ages. Coalescent tree models also sound intriguing, especially the multi-species coalescent model, which explicitly models the evolution of genes, in combination with the speciation process. I think this is promising if we consider certain morphological attributes akin to genes, and overall shapes as species. These could have different trees, just as genes and species sometimes have different trees. Another intriguing thing about the multi-species coalescent model is that it is also frequently mentioned in the literature on phylogenetic networks, hybridisation and reticulation, which I think must be part of a comprehensive modern archaeological phylogenetics.

The image from below shows four different tree models, and how different speciation and extinction parameters can affect the tree for each:

image

MCMC

The image below, from Wright (2019), summarises what MCMC is doing for us in a phylogenetic analysis (I guess you might already be familiar with MCMC from reading McElreath's book?)

image

felixriede commented 3 years ago

Hi Ben - lovely, those are some seriously useful references, too. I agree that Brownian motion is the most common model chosen in cultural phylo studies and, as far as I understand, the one making the least assumptions about process constriants. Regarding the tree models, it'd be amazing if we could implement the fossilized birth-death (FBD) process but data requirements would be steep. Perhaps the Nicolas data would be good as a platform for exploring these, or else perhaps Gopher's (1994) sample of Neolithic points from the Levant. Any idea how that tree models deals with missing parameter information?