The climate variables reduction only between two species?

Zeroo11 commented 1 year ago

Hello, Professor. Sorry to bother you again. QAQ I see your article (A New Method for Integrating Ecological Niche Modeling with Phylogenetics to Estimate Ancestral Distributions) says "The variable reduction is performed by generating ENMs using boostedregression trees (Elith et al 2019) as implemented in the function “humboldt.top.env” from the R package humboldt (Brown and Carnaval 2019)". The “humboldt.top.env” function requires parameters env1, env2, sp1 and sp2, nly between two species? However, you used four species in your article, how to do the variable reduction by “humboldt.top.env” function?

Looking forward to your reply。

wxguillo commented 1 year ago

Hi Zeroo, we have an equivalent function in Machuruku, machu.top.env, that works for multiple species. Try that one out instead!

Zeroo11 commented 1 year ago

Thank you, I have tried.

I use the eample data in your paper.

When I run machu.top.env(occ.rarefied, ClimCur, method = "contrib", contrib.greater = 10), I get an error message：

**Error in var(threshold.stats, use = "complete.obs") : 
no complete element pairs**
In addition: There were 32 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: algorithm did not converge
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
6: glm.fit: fitted probabilities numerically 0 or 1 occurred
7: glm.fit: fitted probabilities numerically 0 or 1 occurred
8: glm.fit: fitted probabilities numerically 0 or 1 occurred
9: glm.fit: fitted probabilities numerically 0 or 1 occurred
10: glm.fit: fitted probabilities numerically 0 or 1 occurred
11: glm.fit: fitted probabilities numerically 0 or 1 occurred
12: glm.fit: algorithm did not converge
13: glm.fit: fitted probabilities numerically 0 or 1 occurred
14: glm.fit: fitted probabilities numerically 0 or 1 occurred
15: glm.fit: fitted probabilities numerically 0 or 1 occurred
16: glm.fit: fitted probabilities numerically 0 or 1 occurred
17: glm.fit: fitted probabilities numerically 0 or 1 occurred
18: glm.fit: fitted probabilities numerically 0 or 1 occurred
19: In cor(y_i, u_i) : the standard deviation is zero
20: In cor(y_i, u_i) : the standard deviation is zero
21: In cor(y_i, u_i) : the standard deviation is zero
22: In cor(y_i, u_i) : the standard deviation is zero
23: glm.fit: algorithm did not converge
24: glm.fit: fitted probabilities numerically 0 or 1 occurred
25: glm.fit: fitted probabilities numerically 0 or 1 occurred
26: In cor(y_i, u_i) : the standard deviation is zero
27: In cor(y_i, u_i) : the standard deviation is zero
28: In cor(y_i, u_i) : the standard deviation is zero
29: In cor(y_i, u_i) : the standard deviation is zero
30: In cor(y_i, u_i) : the standard deviation is zero
31: In cor(y_i, u_i) : the standard deviation is zero
32: In cor(y_i, u_i) : the standard deviation is zero

Howerver, when I use unreduced data, although there is a warning message, I can get the top environment variable result:

[1] "Important variables:"
[1] "bio_1"  "bio_12" "bio_13" "bio_14" "bio_15" "bio_18" "bio_19"
**There were 35 warnings (use warnings() to see them)**

Using "nvars" method, there is also an error.

> machu.top.env(occ, ClimCur, method = "nvars", nvars.save = 6)
**Error in h(simpleError(msg, call)) : 
error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': could not find function "bind_rows"
In addition: There were 37 warnings (use warnings() to see them)**

Using the my own data.

The first error occurs with both methods ("contrib" and "nvars").

**Error in var(threshold.stats, use = "complete.obs") : 
no complete element pairs**
In addition: There were 12 warnings (use warnings() to see them)

Is it because of NA problem? I extract the climate data for each distribution point and checked that there are no NA.

> all.point<-occ[,-1]  
> coordinates(all.point) <- ~long_DD+lat_DD
> all.point.clim<-extract(ClimCur,all.point)
> class(all.point.clim)
[1] "matrix" "array" 
> anyNA(all.point.clim)
[1] FALSE

Do you know the reason for these two errors?

wxguillo commented 1 year ago

The issue with the example data might be because one of the species has too few occurrence points after rarefication. The second problem is probably because the package dplyr is not loaded --- not importing that was my mistake, if you do library(dplyr) it should fix it.

As for your own data, I can't really say what the problem is without the data to look at myself. I also revamped this function for Machuruku 2, you can try that version and see if it fixes the issue. Do make sure that you've specified the columns in your occurrence data properly. The function assumes species is in column 1, longitude in col 2, and latitude in col 3, unless you specify otherwise.

Updated machu.top.env function:

machu.top.env <- function(occ, clim, sp.col = 1, long.col = 2, lat.col = 3, learning.rt = 0.01, pa.ratio = 4, steps = 50, method = "contrib", nvars.save = 5, contrib.greater = 5, verbose = T) # method=='estimate' method=='contrib' method=='nvars'
{
  # input adjustments
  if (method=="ESTIMATE" | method=="Estimate") method <- "estimate"
  if (method=="CONTRIB" | method=="Contrib" | method=="contribution") method <- "contrib"
  if (method=="Nvars" | method=="NVARS") method <- "nvars"
  # this is necessary bc the gbm.step function works by setting "silent" rather than "verbose", which are opposites
  if (verbose == T) verbose <- F else if (verbose == F) verbose <- T

  # determine indices (i.e., which columns, corresponding to no. climate variables) of predictor vars
  e.var <- 4:(3+nlayers(clim))

  # identify max number of occurrence points across all species
  n.occ.max <- max(table(occ$sp))
  # randomly sample n.occ.max*30 points from the raster climate data: these will be the pool from which pseudoabsences will be drawn
  # however if the data is low resolution there will likely be too few cells to sample w/o replacement
  # check if this is the case, and if so, change n.occ.max to n.cells
  n.cells <- nrow(rasterToPoints(clim[[1]], fun=NULL, spatial=FALSE)) # doesn't count ocean cells
  if (n.occ.max*30 > n.cells) n.occ.max <- n.cells
  if (verbose == F) { print("Extracting climate values from rasters. This could take a minute.")}
  pa.pool <- sampleRandom(clim, n.occ.max, xy=T)
  if (verbose == F) { print("Finished extracting climate values.")}
  ID <- rep(0, nrow(pa.pool))
  pa.pool <- cbind(ID, pa.pool)

  # initialize output list
  imp.vars <- list()
  # loop over each taxon
  for (i in 1:length(unique(occ$sp))){

    # get species name
    sp.name <- unique(occ$sp)[i]

    # get number of points for that taxon
    n.occ <- nrow(subset(occ, sp==sp.name))

    # randomly sample n.occ*4 pseudoabsence (pa) points from pa.pool (4 is changeable with pa.ratio)
    # there have to be at least 50 points for the gbm model to work
    # if n.occ*4 > n.cells, set n.pa = n.cells
    # if n.occ > n.cells, set n.pa
    # perform various other checks and corrections
    n.pa <- n.occ*pa.ratio
    if (n.occ+(n.occ*pa.ratio) < 50) n.pa <- 50-n.occ
    if (n.occ*pa.ratio > n.cells) n.pa <- n.cells

    # sample pseudoabsences
    sp.pa <- pa.pool[sample(nrow(pa.pool), n.pa),]

    # get occurrence points for that species
    sp.occ <- subset(occ, sp==sp.name)[,c(long.col, lat.col)]
    colnames(sp.occ) <- c("x", "y")
    ID <- rep(1, nrow(sp.occ))

    # extract climate values for actual occurrence points and put it all in a table
    sp.data <- data.frame(ID, sp.occ, extract(clim, sp.occ))

    # combine pseudoabsence and species occurrence tables
    datamodel <- rbind(sp.pa, sp.data)

    if (method == "nvars" || method == "contrib") {
      modelOut <- gbm.step(data = datamodel, gbm.x = e.var, gbm.y = 1, family = "bernoulli",
                            tree.complexity = 5, learning.rate = learning.rt, step.size = steps, bag.fraction = 0.5,
                            plot.main = FALSE, plot.folds = FALSE, silent = verbose)
    }
    if (method == "estimate") {
      modelOut1 <<- gbm.step(data = datamodel, gbm.x = e.var, gbm.y = 1, family = "bernoulli",
                              tree.complexity = 5, learning.rate = learning.rt, step.size = steps, bag.fraction = 0.5,
                              plot.main = FALSE, plot.folds = FALSE, silent = verbose)
      mod.simp <<- gbm.simplify(modelOut1)
      min.y <- min(c(0, mod.simp$deviance.summary$mean))
      n.vars1 <- match(min.y, c(0, mod.simp$deviance.summary$mean)) - 1
      vars1.use <- mod.simp$pred.list[[n.vars1]]
      names(env1)[vars1.use]
      modelOut <- gbm.step(data = datamodel, gbm.x = mod.simp$pred.list[[n.vars1]], gbm.y = 1,
                            family = "bernoulli", tree.complexity = 5, learning.rate = learning.rt, step.size = steps,
                            bag.fraction = 0.5, plot.main = FALSE, plot.folds = FALSE, silent = verbose)
    }
    if (method == "nvars") {
      varInc <- as.vector(modelOut$contributions[1:nvars.save, ])
    }
    if (method == "estimate") {
      varInc <- as.vector(modelOut$contributions[, 1])
    }
    if (method == "contrib") {
      varInc <- as.vector(modelOut$contributions$var[modelOut$contributions$rel.inf > contrib.greater])
    }
    imp.vars[[i]] <- varInc
  }
  # sort the combined variable importances and take the top nvars.save across all taxa
  if (method == "nvars") {
    h <- as.data.frame(bind_rows(imp.vars))
    k <- as.vector(unique(h$var))
    for (i in k){
      j<-h[h$var==i,]
      if (match(i, k) == 1){
        l <- j[j$rel.inf==max(j$rel.inf),]
      } else {
        l <- rbind(l, j[j$rel.inf==max(j$rel.inf),])
      }
    }
    l <- l[order(-l$rel.inf),]
    imp.var.out <- as.vector(l[1:nvars.save,][,1])
  }
  if (method == "contrib" || method == "estimate"){
    imp.var.out <- sort(unique(unlist(imp.vars)))
  }
  print("Important variables:")
  print(imp.var.out)
  return(imp.var.out)
}

wxguillo commented 1 year ago

Also, as an alternative, you can try the removeCollinearity function from the virtualspecies package, which does pretty much the same thing in a different way.

wxguillo / machuruku

The climate variables reduction only between two species? #3

I use the eample data in your paper.

Using the my own data.