welch-lab / liger

R package for integrating and analyzing multiple single-cell datasets
GNU General Public License v3.0
381 stars 78 forks source link

selectGenes No genes were selected; lower var.thresh values or choose 'union' for combine parameter #208

Closed Roger-GOAT closed 3 years ago

Roger-GOAT commented 3 years ago

Hi, In "selectGenes", no matter how I change the parameter, it showed "No genes were selected; lower var.thresh values or choose 'union' for combine parameter". The var.thresh has been set from 0.5 to 0.0001. And I change the combine to union or intersection. If I set "do.plot = T", the picture is showed as below. image and notice "Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) : 'a' and 'b' must be finite" Could anyone give some tips? Thanks a lot!

With my code: ifnb_liger <- createLiger(list(stim = "./scdata/h5data/day2.h5", ctrl = "./scdata/h5data/day0.h5")) ifnb_liger <- liger::normalize(ifnb_liger) ifnb_liger <- selectGenes(ifnb_liger,var.thresh = 0.1)

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS/LAPACK: /opt/intel/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] clustree_0.4.3 ggraph_2.0.4 RCurl_1.98-1.2
[4] SeuratWrappers_0.3.0 forcats_0.5.0 stringr_1.4.0
[7] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[10] tibble_3.0.6 tidyverse_1.3.0 dplyr_1.0.4
[13] dbplyr_2.0.0 harmony_1.0 Rcpp_1.0.6
[16] SingleR_1.4.0 SummarizedExperiment_1.20.0 Biobase_2.50.0
[19] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2 IRanges_2.24.1
[22] S4Vectors_0.28.1 BiocGenerics_0.36.0 MatrixGenerics_1.2.0
[25] matrixStats_0.58.0 liger_0.5.0 Matrix_1.2-18
[28] cowplot_1.1.1 future_1.21.0 patchwork_1.1.1
[31] sctransform_0.3.2 ggplot2_3.3.3 SeuratObject_4.0.0
[34] Seurat_4.0.0

loaded via a namespace (and not attached): [1] readxl_1.3.1 snow_0.4-3 backports_1.2.1
[4] plyr_1.8.6 igraph_1.2.6 lazyeval_0.2.2
[7] splines_4.0.3 BiocParallel_1.24.1 listenv_0.8.0
[10] scattermore_0.7 digest_0.6.27 foreach_1.5.1
[13] htmltools_0.5.1.1 viridis_0.5.1 checkmate_2.0.0
[16] magrittr_2.0.1 tensor_1.5 cluster_2.1.0
[19] ROCR_1.0-11 limma_3.46.0 remotes_2.2.0
[22] graphlayouts_0.7.1 globals_0.14.0 modelr_0.1.8
[25] colorspace_2.0-0 rvest_0.3.6 ggrepel_0.9.1
[28] haven_2.3.1 crayon_1.4.1 jsonlite_1.7.2
[31] spatstat_1.64-1 spatstat.data_1.7-0 survival_3.2-7
[34] zoo_1.8-8 iterators_1.0.13 glue_1.4.2
[37] polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.36.0
[40] XVector_0.30.0 leiden_0.3.7 DelayedArray_0.16.0
[43] BiocSingular_1.6.0 future.apply_1.7.0 abind_1.4-5
[46] scales_1.1.1 pheatmap_1.0.12 DBI_1.1.0
[49] miniUI_0.1.1.1 viridisLite_0.3.0 xtable_1.8-4
[52] riverplot_0.10 reticulate_1.18 bit_4.0.4
[55] rsvd_1.0.3 mclust_5.4.7 htmlwidgets_1.5.3
[58] httr_1.4.2 FNN_1.1.3 RColorBrewer_1.1-2
[61] ellipsis_0.3.1 ica_1.0-2 farver_2.0.3
[64] pkgconfig_2.0.3 uwot_0.1.10 deldir_0.2-9
[67] labeling_0.4.2 tidyselect_1.1.0 rlang_0.4.10
[70] reshape2_1.4.4 later_1.1.0.1 cellranger_1.1.0
[73] munsell_0.5.0 tools_4.0.3 cli_2.3.0
[76] generics_0.1.0 broom_0.7.2 ggridges_0.5.3
[79] fastmap_1.1.0 goftest_1.2-2 fs_1.5.0
[82] bit64_4.0.5 tidygraph_1.2.0 fitdistrplus_1.1-3
[85] RANN_2.6.1 pbapply_1.4-3 nlme_3.1-151
[88] sparseMatrixStats_1.2.0 mime_0.10 xml2_1.3.2
[91] rstudioapi_0.13 hdf5r_1.3.3 compiler_4.0.3
[94] plotly_4.9.3 png_0.1-7 spatstat.utils_2.0-0
[97] reprex_0.3.0 tweenr_1.0.1 stringi_1.5.3
[100] RSpectra_0.16-0 lattice_0.20-41 vctrs_0.3.6
[103] pillar_1.4.7 lifecycle_1.0.0 BiocManager_1.30.10
[106] lmtest_0.9-38 RcppAnnoy_0.0.18 BiocNeighbors_1.8.2
[109] data.table_1.13.6 bitops_1.0-6 irlba_2.3.3
[112] httpuv_1.5.5 R6_2.5.0 promises_1.1.1
[115] KernSmooth_2.23-18 gridExtra_2.3 parallelly_1.23.0
[118] codetools_0.2-18 MASS_7.3-53 assertthat_0.2.1
[121] withr_2.4.1 GenomeInfoDbData_1.2.4 hms_0.5.3
[124] mgcv_1.8-33 doSNOW_1.0.19 grid_4.0.3
[127] rpart_4.1-15 beachmat_2.6.2 DelayedMatrixStats_1.12.1 [130] Rtsne_0.15 ggforce_0.3.2 lubridate_1.7.9.2
[133] shiny_1.6.0

bschilder commented 3 years ago

Having similar issues here. selectGenes() is only returning 1 to 3 genes (depending on the dataset). Have also had it return 0 in the recent past. Bc there's no message output by default, turns out this has been silently causing problems with the ability of optimizeALS() (1-3 genes isn't a whole lot of info to use for scRNAseq data integration...) @NathanSkene

Is there a way I can manually set the variable genes in the liger object? Seurat::FindVariableFeatures() seems to identify plenty of genes (1000s) on the same dataset so I thought I might be able to insert them into the liger object.

seurat_obj <- Seurat::FindVariableFeatures(seurat_obj)
liger_obj@var.genes <- VariableFeatures(seurat_obj)

But this causes downstream issues: lig <- scaleNotCenter(lig)

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 't': invalid character indexing

I've also tried importing the var genes with seuratToLiger(). But using this function breaks other downstream liger functions. So I instead manually construct a expression matrix list:

exp_list <- lapply(unique(seurat_obj$dataset),function(x){
    Seurat::GetAssayData(seurat_obj[,seurat_obj$dataset==x], 
                         slot = "data") 
  }) %>% `names<-`(unique(seurat_obj$dataset))
 liger_object <- createLiger(exp_list, make.sparse = T,  
                     take.gene.union = FALSE) 

Same issues with both the rliger and liger installations.

cgao90 commented 3 years ago

Hi Roger @Roger-GOAT ,

Thank you for reporting this issue. It is likely due to that both of the input data (in h5 format) contain cells not expressing any genes. At this moment, online Liger is not able to filter them during the object creation, but we will work on it. A temporary solution for now is to exclude these cells from the dataset and regenerate the h5 files. Please let us know if this resolves the problem.

Best, Chao

cgao90 commented 3 years ago

Hi Brian @bschilder ,

Thanks very much for bring it up. Given the limited information, I am not sure what led to few selected variable genes. As for the failure of inserting genes identified by Seurat, it is because Liger (in memory approach) filters genes not expressed in any cells in each input dataset. If some of these genes (from Seurat) are among those get removed, there will be the "invalid character indexing" error.

To address this problem, a quick fix has been added to the package. Try re-installing Liger from github and setting remove.missing = F when calling both createLiger and normalize. Please keep us posted whether this allows you to proceed with your analysis.

Best, Chao

skpalan commented 3 years ago

Hi Brain @bschilder, what is the format of your input data, h5 or in-memory matrix object? Also have you tried setting different var.thresh for the selectGenes function?

bschilder commented 3 years ago

Many thanks to you both for the prompt replies! Here's a bit more info on what I'm seeing.

@cgao90 When i supply 8 datasets (fly, mouse, zebrafish and human): lig <- createLiger(seurat, take.gene.union = T) : 410 genes lig <- selectGenes(lig): ~250 genes

When I supply just 3 datasets (mouse and fly): lig <- createLiger(seurat, take.gene.union = T) : 4,297 genes lig <- selectGenes(lig): anywhere between 1 and 48 genes (but never much more than that regardless of how I tweak the parameters)

Giving the liger mods a go now, will let you know.

...As for the failure of inserting genes identified by Seurat, it is because Liger (in memory approach) filters genes not expressed in any cells in each input dataset. If some of these genes (from Seurat) are among those get removed, there will be the "invalid character indexing" error.

I made sure to filter out non-expressed genes and remove variable genes not in the final liger dataset before trying to insert them into the lig@var.genes slot. So I don't think that's the source of the errors.

@skpalan Tried a couple different formats when I extract the matrices with lapply, all of which were in-memory.

bschilder commented 3 years ago

@cgao90

Just tried out the new mods. Here's the script:

{
  var_genes <-  lapply(SplitObject(seurat_lvl.sub,split.by = "dataset"),function(x){
    VariableFeatures(FindVariableFeatures(x))
  } ) %>% unlist() %>% unique()

  exp_list <- lapply(unique(seurat_lvl.sub$dataset),function(x){
    Seurat::GetAssayData(seurat_lvl.sub[,seurat_lvl.sub$dataset==x], 
                         slot = "data")[var_genes,]
  }) %>% `names<-`(unique(seurat_lvl.sub$dataset))

  lig <- createLiger(exp_list, make.sparse = T, 
                     take.gene.union = T, 
                     remove.missing = F) 
  lig <- normalize(lig, remove.missing = F) 
  # lig <- selectGenes(lig, num.genes = 2000, tol = .00000000001, alpha.thresh = 8000)
  lig <- selectGenes(lig)
  # lig@var.genes <- var_genes[var_genes%in%row.names(lig@norm.data[[1]])]
  lig <- scaleNotCenter(lig)
}

{
  lig <- optimizeALS(lig, k = 20)
  lig <- quantile_norm(lig)
  lig <- louvainCluster(lig)
}

{
  lig <- runUMAP(lig,
                 n_neighbors = 30, 
                 min_dist = 0.3,
                 distance = 'cosine')

  all.plots <- plotByDatasetAndCluster(lig, 
                                       pt.size = 1,
                                       axis.labels = c('UMAP 1', 'UMAP 2'), 
                                       return.plots = T)
  patchwork::wrap_plots(all.plots)
}

It's able to run and uses 2546 variable genes, but seems to be struggling to integrate across datasets (shown here by color). I get quite a similar level of integration just by running UMAP on the raw dataset (without any batch correction/integration methods):

Screenshot 2021-02-19 at 12 55 25

In contrast, I ran the same seurat object through fastMNN which seemed to be able to integrate well across datasets and form clusters based on cell-type identity. Which is great, but I don't believe fastMNN returns a post-integration normalized matrix (like LIGER's H.norm), which is really what I'm after as I'd like to do some follow up analyses on this corrected gene x cell matrix.

seurat_lvl <- seurat_lvl.sub
seurat_lvl <- NormalizeData(seurat_lvl)
# seurat_lvl <- FindVariableFeatures(seurat_lvl)
var_genes <-  lapply(SplitObject(seurat_lvl,split.by = "dataset"),function(x){
    max_genes <- as.integer(2000/length(unique(seurat_lvl$dataset)))
    VariableFeatures(FindVariableFeatures(x))[1:max_genes]
  } ) %>% unlist() %>% unique()
seurat_lvl <- RunFastMNN(object.list = SplitObject(seurat_lvl, split.by = "batch"),
                         features = VariableFeatures(seurat_lvl), 
                         subset.row=var_genes)

seurat_lvl <- RunUMAP(seurat_lvl, reduction = "mnn", dims = 1:30) 
seurat_lvl <- FindNeighbors(seurat_lvl, reduction = "umap", dims = 1:2)
seurat_lvl <- FindClusters(seurat_lvl)
# NNPlot(seurat_lvl, nn.idx = 1, query.cells = 1) ## No instructions ???

umap_mnn_plot <- DimPlot(seurat_lvl, 
                         reduction = "umap",
                         shape.by = "species",
                         group.by = c("dataset","species","celltype"), 
                         ncol = 2) + NoLegend() 
Screenshot 2021-02-19 at 13 06 40

Any thoughts as to what might be going on here? Let me know if I'm not using LIGER properly!

Thanks, Brian

bschilder commented 3 years ago

Tried same LIGER code as above, but without removing non-variable genes when constructing my expression matrix list. Slightly better, but dataset 1 (salmon-colored) is still off on its own cluster (despite being composed of neurons/glia/etc like the other datasets).

Screenshot 2021-02-19 at 13 28 33
skpalan commented 3 years ago

Hi Brian @bschilder , glad to see that the pipeline is working fine now!

For the integration results, firstly you can try to set a higher k like 40 and lambda like 10 for the optimizeALS function, and probably also a larger knn_k like 30 for the quantile_norm function, to force a better alignment between the datasets. However, one thing that is worth to be noticed here is that LIGER does NOT do batch correction to integrate the data. Actually, LIGER was designed to reserve the differences between the datasets, and this unique feature can help users to further explore these differences (see functions runWilcoxon, plotFactors) and also serves as a "sanity check" on the integration results. In other words, if the input datasets do NOT share much biological similarities, LIGER just refuses to align them or only aligns the parts that share similarities.

We recommand that you can check out Box 1-3 in our recently published paper here, in which we covered many methods to properly evaluate the alignment results. We also included an example of the poor alignments between two unrelated samples.

Also, if you still have problems or updates on integrating datasets, please open a new issue and we will be happy to discuss further. This thread is aimming to solve questions for selectGenes and scaleNotCenter. Thanks!

bschilder commented 3 years ago

@skpalan I really appreciate the detailed explanation! Thank you so much for taking the time to write that out. I think I have a much clearer idea of what's going on with my data now.

Thanks again, Brian

cgao90 commented 3 years ago

@bschilder ,

1.Are those datasets (from mouse, fly, human, etc) from the same modality? say scRNA-seq?

2.It seems you tried selecting genes using Liger without removing non-expressed genes. In this way, the batch-effect might be amplified. Were you able to run lig@var.genes <- var_genes (selected from Seurat) without the indexing error while keeping those non-expressed genes in the liger object? If so, how does the result compare to the one from fastMNN?

jw156605 commented 3 years ago

Another issue with cross-species data is gene names. Liger requires homologous genes to have exactly the same names in every dataset (e.g., BRCA1 in human, Brca1 in mouse, and brca1 in fish would all need to be renamed to BRCA1).

bschilder commented 3 years ago

@cgao90

  1. They are all scRNAseq with overlapping tissues and cell-types. However i did transform them into pseudobulk within each dataset, by taking the mean expression of each cell-subtype. These pseudobulk expression matrices are what I'm feeding into LIGER. This was partly to try to reduce the dataset size for this initial exploration.
  2. In my case,remove.missing has no effect on the number of cells, but it obviously does have quite an impact on the number of genes (see next section).
  3. Manually inserting lig@var.genes <- var_genes now works when I make sure to set remove.missing=F at all possible steps.
var_genes <-  lapply(SplitObject(seurat_lvl.sub,split.by = "dataset"),function(x){
    VariableFeatures(FindVariableFeatures(x))
  } ) %>% unlist() %>% unique()

exp_list <- lapply(unique(seurat_lvl.sub$dataset),function(x){
    seur <- seurat_lvl.sub[,seurat_lvl.sub$dataset==x]
    seur <- seur[rowSums(seur)>0,] 
    Seurat::GetAssayData(seur, 
                         slot = "data")#[var_genes,] 
  }) %>% `names<-`(unique(seurat_lvl.sub$dataset))
  lapply(exp_list, dim)

  lig <- createLiger(exp_list, make.sparse = T, 
                     take.gene.union = T, 
                     # Remove cells without any expressed genes, 
                     ## and genes not expressed in any dataset.
                     remove.missing = F) 
  lig <- normalize(lig, remove.missing = F)    

  lig@var.genes <- var_genes
  lig <- scaleNotCenter(lig, remove.missing = F)

{
  lig <- optimizeALS(lig, k = 20)#, lambda = 10)
  lig <- quantile_norm(lig)
  lig <- louvainCluster(lig)
}

This produces the following UMAP projection, with similar separation between datasets as before:

Screenshot 2021-02-23 at 20 51 14

When I repeat the same steps but using only the gene intersect during createLiger(), I can manually select all variable genes (thought liger::selectGenes() does select ~250 of these by default):

gene_lists <- lapply(lig@norm.data, function(x){row.names(x)})
  gene_intersect <- Reduce(intersect, gene_lists)

With the intersecting data, zebrafish (blue) does integrate much better, as well as a subset of the fly data (salmon-colored)

Screenshot 2021-02-23 at 21 02 50

@jw156605

  1. I mapped all species to 1:1 human orthologues before creating the merged dataset. This does present a challenge, tho, because with increasing evolutionary divergence the fewer 1:1 orths remain. For example, this leaves 10k genes with my zebrafish dataset and 1k genes with the fly dataset. Meaning gene filtering steps after orthologue conversion can leave us with very few genes within certain dataset, or when taking the intersect of genes expressed in all datasets.

@skpalan

  1. Raising k past 24 seems to induce the following error when I go to run the optimiseALS() step when using the all-datasets with take.gene.union=F (leaves between 1k to 18k genes within each lig@raw.data object) . At k=24, the datasets show a similar level of separation.
    Error in if (trim > 0 && n) { : missing value where TRUE/FALSE needed
  2. Raising lambda to 10 (while keeping k=20) runs until 50% and then produces the same error as above. I've noticed this same error in a number of different situations (certain combinations of datasets, parameters) but I can't seem to figure out the logic to it.

While I'd expect there to be within-celltype-cluster separation between datasets (e.g. fly vs mouse neurons within the neuronal cluster), I'm seeing total separation by dataset regardless of cell-type. As you can see in the plots above, taking the gene intersection seems to better integrate the distant species because they no longer primarily being driven by a bunch of non-expressed genes (that are present in other species). However, this is at the cost reducing the number of obvious cell-type clusters within the mammals.

cgao90 commented 3 years ago

@bschilder It is also possible that transforming data into pseudobulk has an impact on the matrix factorization results.

bschilder commented 3 years ago

@cgao90 I think the pseudobulk was indeed affecting many of these issues, simply because there were far too few "cells" in each dataset (ranging between 6 pseudobulk cells to dozens). I've gone back to the raw scRNAseq data to try and integrate at that level, but this seems to cause R to crash every time (even with smallest datasets). I'll describe in more detail in a separate Issue since it goes a bit beyond what this thread is discussing.

Thanks as always, Brian