Closed Roger-GOAT closed 3 years ago
Having similar issues here. selectGenes()
is only returning 1 to 3 genes (depending on the dataset). Have also had it return 0 in the recent past. Bc there's no message output by default, turns out this has been silently causing problems with the ability of optimizeALS()
(1-3 genes isn't a whole lot of info to use for scRNAseq data integration...)
@NathanSkene
Is there a way I can manually set the variable genes in the liger object? Seurat::FindVariableFeatures()
seems to identify plenty of genes (1000s) on the same dataset so I thought I might be able to insert them into the liger object.
seurat_obj <- Seurat::FindVariableFeatures(seurat_obj)
liger_obj@var.genes <- VariableFeatures(seurat_obj)
But this causes downstream issues:
lig <- scaleNotCenter(lig)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 't': invalid character indexing
I've also tried importing the var genes with seuratToLiger()
. But using this function breaks other downstream liger functions. So I instead manually construct a expression matrix list:
exp_list <- lapply(unique(seurat_obj$dataset),function(x){
Seurat::GetAssayData(seurat_obj[,seurat_obj$dataset==x],
slot = "data")
}) %>% `names<-`(unique(seurat_obj$dataset))
liger_object <- createLiger(exp_list, make.sparse = T,
take.gene.union = FALSE)
Same issues with both the rliger and liger installations.
Hi Roger @Roger-GOAT ,
Thank you for reporting this issue. It is likely due to that both of the input data (in h5 format) contain cells not expressing any genes. At this moment, online Liger is not able to filter them during the object creation, but we will work on it. A temporary solution for now is to exclude these cells from the dataset and regenerate the h5 files. Please let us know if this resolves the problem.
Best, Chao
Hi Brian @bschilder ,
Thanks very much for bring it up. Given the limited information, I am not sure what led to few selected variable genes. As for the failure of inserting genes identified by Seurat, it is because Liger (in memory approach) filters genes not expressed in any cells in each input dataset. If some of these genes (from Seurat) are among those get removed, there will be the "invalid character indexing" error.
To address this problem, a quick fix has been added to the package. Try re-installing Liger from github and setting remove.missing = F
when calling both createLiger
and normalize
. Please keep us posted whether this allows you to proceed with your analysis.
Best, Chao
Hi Brain @bschilder, what is the format of your input data, h5 or in-memory matrix object? Also have you tried setting different var.thresh
for the selectGenes
function?
Many thanks to you both for the prompt replies! Here's a bit more info on what I'm seeing.
@cgao90 When i supply 8 datasets (fly, mouse, zebrafish and human): lig <- createLiger(seurat, take.gene.union = T) : 410 genes lig <- selectGenes(lig): ~250 genes
When I supply just 3 datasets (mouse and fly): lig <- createLiger(seurat, take.gene.union = T) : 4,297 genes lig <- selectGenes(lig): anywhere between 1 and 48 genes (but never much more than that regardless of how I tweak the parameters)
Giving the liger mods a go now, will let you know.
...As for the failure of inserting genes identified by Seurat, it is because Liger (in memory approach) filters genes not expressed in any cells in each input dataset. If some of these genes (from Seurat) are among those get removed, there will be the "invalid character indexing" error.
I made sure to filter out non-expressed genes and remove variable genes not in the final liger dataset before trying to insert them into the lig@var.genes
slot. So I don't think that's the source of the errors.
@skpalan Tried a couple different formats when I extract the matrices with lapply, all of which were in-memory.
@cgao90
Just tried out the new mods. Here's the script:
{
var_genes <- lapply(SplitObject(seurat_lvl.sub,split.by = "dataset"),function(x){
VariableFeatures(FindVariableFeatures(x))
} ) %>% unlist() %>% unique()
exp_list <- lapply(unique(seurat_lvl.sub$dataset),function(x){
Seurat::GetAssayData(seurat_lvl.sub[,seurat_lvl.sub$dataset==x],
slot = "data")[var_genes,]
}) %>% `names<-`(unique(seurat_lvl.sub$dataset))
lig <- createLiger(exp_list, make.sparse = T,
take.gene.union = T,
remove.missing = F)
lig <- normalize(lig, remove.missing = F)
# lig <- selectGenes(lig, num.genes = 2000, tol = .00000000001, alpha.thresh = 8000)
lig <- selectGenes(lig)
# lig@var.genes <- var_genes[var_genes%in%row.names(lig@norm.data[[1]])]
lig <- scaleNotCenter(lig)
}
{
lig <- optimizeALS(lig, k = 20)
lig <- quantile_norm(lig)
lig <- louvainCluster(lig)
}
{
lig <- runUMAP(lig,
n_neighbors = 30,
min_dist = 0.3,
distance = 'cosine')
all.plots <- plotByDatasetAndCluster(lig,
pt.size = 1,
axis.labels = c('UMAP 1', 'UMAP 2'),
return.plots = T)
patchwork::wrap_plots(all.plots)
}
It's able to run and uses 2546 variable genes, but seems to be struggling to integrate across datasets (shown here by color). I get quite a similar level of integration just by running UMAP on the raw dataset (without any batch correction/integration methods):
In contrast, I ran the same seurat object through fastMNN which seemed to be able to integrate well across datasets and form clusters based on cell-type identity. Which is great, but I don't believe fastMNN returns a post-integration normalized matrix (like LIGER's H.norm), which is really what I'm after as I'd like to do some follow up analyses on this corrected gene x cell matrix.
seurat_lvl <- seurat_lvl.sub
seurat_lvl <- NormalizeData(seurat_lvl)
# seurat_lvl <- FindVariableFeatures(seurat_lvl)
var_genes <- lapply(SplitObject(seurat_lvl,split.by = "dataset"),function(x){
max_genes <- as.integer(2000/length(unique(seurat_lvl$dataset)))
VariableFeatures(FindVariableFeatures(x))[1:max_genes]
} ) %>% unlist() %>% unique()
seurat_lvl <- RunFastMNN(object.list = SplitObject(seurat_lvl, split.by = "batch"),
features = VariableFeatures(seurat_lvl),
subset.row=var_genes)
seurat_lvl <- RunUMAP(seurat_lvl, reduction = "mnn", dims = 1:30)
seurat_lvl <- FindNeighbors(seurat_lvl, reduction = "umap", dims = 1:2)
seurat_lvl <- FindClusters(seurat_lvl)
# NNPlot(seurat_lvl, nn.idx = 1, query.cells = 1) ## No instructions ???
umap_mnn_plot <- DimPlot(seurat_lvl,
reduction = "umap",
shape.by = "species",
group.by = c("dataset","species","celltype"),
ncol = 2) + NoLegend()
Any thoughts as to what might be going on here? Let me know if I'm not using LIGER properly!
Thanks, Brian
Tried same LIGER code as above, but without removing non-variable genes when constructing my expression matrix list. Slightly better, but dataset 1 (salmon-colored) is still off on its own cluster (despite being composed of neurons/glia/etc like the other datasets).
Hi Brian @bschilder , glad to see that the pipeline is working fine now!
For the integration results, firstly you can try to set a higher k like 40 and lambda like 10 for the optimizeALS
function, and probably also a larger knn_k like 30 for the quantile_norm
function, to force a better alignment between the datasets. However, one thing that is worth to be noticed here is that LIGER does NOT do batch correction to integrate the data. Actually, LIGER was designed to reserve the differences between the datasets, and this unique feature can help users to further explore these differences (see functions runWilcoxon
, plotFactors
) and also serves as a "sanity check" on the integration results. In other words, if the input datasets do NOT share much biological similarities, LIGER just refuses to align them or only aligns the parts that share similarities.
We recommand that you can check out Box 1-3 in our recently published paper here, in which we covered many methods to properly evaluate the alignment results. We also included an example of the poor alignments between two unrelated samples.
Also, if you still have problems or updates on integrating datasets, please open a new issue and we will be happy to discuss further. This thread is aimming to solve questions for selectGenes
and scaleNotCenter
. Thanks!
@skpalan I really appreciate the detailed explanation! Thank you so much for taking the time to write that out. I think I have a much clearer idea of what's going on with my data now.
Thanks again, Brian
@bschilder ,
1.Are those datasets (from mouse, fly, human, etc) from the same modality? say scRNA-seq?
2.It seems you tried selecting genes using Liger without removing non-expressed genes. In this way, the batch-effect might be amplified. Were you able to run lig@var.genes <- var_genes
(selected from Seurat) without the indexing error while keeping those non-expressed genes in the liger object? If so, how does the result compare to the one from fastMNN?
Another issue with cross-species data is gene names. Liger requires homologous genes to have exactly the same names in every dataset (e.g., BRCA1 in human, Brca1 in mouse, and brca1 in fish would all need to be renamed to BRCA1).
@cgao90
remove.missing
has no effect on the number of cells, but it obviously does have quite an impact on the number of genes (see next section).lig@var.genes <- var_genes
now works when I make sure to set remove.missing=F
at all possible steps.var_genes <- lapply(SplitObject(seurat_lvl.sub,split.by = "dataset"),function(x){
VariableFeatures(FindVariableFeatures(x))
} ) %>% unlist() %>% unique()
exp_list <- lapply(unique(seurat_lvl.sub$dataset),function(x){
seur <- seurat_lvl.sub[,seurat_lvl.sub$dataset==x]
seur <- seur[rowSums(seur)>0,]
Seurat::GetAssayData(seur,
slot = "data")#[var_genes,]
}) %>% `names<-`(unique(seurat_lvl.sub$dataset))
lapply(exp_list, dim)
lig <- createLiger(exp_list, make.sparse = T,
take.gene.union = T,
# Remove cells without any expressed genes,
## and genes not expressed in any dataset.
remove.missing = F)
lig <- normalize(lig, remove.missing = F)
lig@var.genes <- var_genes
lig <- scaleNotCenter(lig, remove.missing = F)
{
lig <- optimizeALS(lig, k = 20)#, lambda = 10)
lig <- quantile_norm(lig)
lig <- louvainCluster(lig)
}
This produces the following UMAP projection, with similar separation between datasets as before:
When I repeat the same steps but using only the gene intersect during createLiger()
, I can manually select all variable genes (thought liger::selectGenes() does select ~250 of these by default):
gene_lists <- lapply(lig@norm.data, function(x){row.names(x)})
gene_intersect <- Reduce(intersect, gene_lists)
With the intersecting data, zebrafish (blue) does integrate much better, as well as a subset of the fly data (salmon-colored)
@jw156605
@skpalan
optimiseALS()
step when using the all-datasets with take.gene.union=F
(leaves between 1k to 18k genes within each lig@raw.data object) . At k=24, the datasets show a similar level of separation.
Error in if (trim > 0 && n) { : missing value where TRUE/FALSE needed
While I'd expect there to be within-celltype-cluster separation between datasets (e.g. fly vs mouse neurons within the neuronal cluster), I'm seeing total separation by dataset regardless of cell-type. As you can see in the plots above, taking the gene intersection seems to better integrate the distant species because they no longer primarily being driven by a bunch of non-expressed genes (that are present in other species). However, this is at the cost reducing the number of obvious cell-type clusters within the mammals.
@bschilder It is also possible that transforming data into pseudobulk has an impact on the matrix factorization results.
@cgao90 I think the pseudobulk was indeed affecting many of these issues, simply because there were far too few "cells" in each dataset (ranging between 6 pseudobulk cells to dozens). I've gone back to the raw scRNAseq data to try and integrate at that level, but this seems to cause R to crash every time (even with smallest datasets). I'll describe in more detail in a separate Issue since it goes a bit beyond what this thread is discussing.
Thanks as always, Brian
Hi, In "selectGenes", no matter how I change the parameter, it showed "No genes were selected; lower var.thresh values or choose 'union' for combine parameter". The var.thresh has been set from 0.5 to 0.0001. And I change the combine to union or intersection. If I set "do.plot = T", the picture is showed as below. and notice "Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) : 'a' and 'b' must be finite" Could anyone give some tips? Thanks a lot!
With my code: ifnb_liger <- createLiger(list(stim = "./scdata/h5data/day2.h5", ctrl = "./scdata/h5data/day0.h5")) ifnb_liger <- liger::normalize(ifnb_liger) ifnb_liger <- selectGenes(ifnb_liger,var.thresh = 0.1)
Matrix products: default BLAS/LAPACK: /opt/intel/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages: [1] clustree_0.4.3 ggraph_2.0.4 RCurl_1.98-1.2
[4] SeuratWrappers_0.3.0 forcats_0.5.0 stringr_1.4.0
[7] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[10] tibble_3.0.6 tidyverse_1.3.0 dplyr_1.0.4
[13] dbplyr_2.0.0 harmony_1.0 Rcpp_1.0.6
[16] SingleR_1.4.0 SummarizedExperiment_1.20.0 Biobase_2.50.0
[19] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2 IRanges_2.24.1
[22] S4Vectors_0.28.1 BiocGenerics_0.36.0 MatrixGenerics_1.2.0
[25] matrixStats_0.58.0 liger_0.5.0 Matrix_1.2-18
[28] cowplot_1.1.1 future_1.21.0 patchwork_1.1.1
[31] sctransform_0.3.2 ggplot2_3.3.3 SeuratObject_4.0.0
[34] Seurat_4.0.0
loaded via a namespace (and not attached): [1] readxl_1.3.1 snow_0.4-3 backports_1.2.1
[4] plyr_1.8.6 igraph_1.2.6 lazyeval_0.2.2
[7] splines_4.0.3 BiocParallel_1.24.1 listenv_0.8.0
[10] scattermore_0.7 digest_0.6.27 foreach_1.5.1
[13] htmltools_0.5.1.1 viridis_0.5.1 checkmate_2.0.0
[16] magrittr_2.0.1 tensor_1.5 cluster_2.1.0
[19] ROCR_1.0-11 limma_3.46.0 remotes_2.2.0
[22] graphlayouts_0.7.1 globals_0.14.0 modelr_0.1.8
[25] colorspace_2.0-0 rvest_0.3.6 ggrepel_0.9.1
[28] haven_2.3.1 crayon_1.4.1 jsonlite_1.7.2
[31] spatstat_1.64-1 spatstat.data_1.7-0 survival_3.2-7
[34] zoo_1.8-8 iterators_1.0.13 glue_1.4.2
[37] polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.36.0
[40] XVector_0.30.0 leiden_0.3.7 DelayedArray_0.16.0
[43] BiocSingular_1.6.0 future.apply_1.7.0 abind_1.4-5
[46] scales_1.1.1 pheatmap_1.0.12 DBI_1.1.0
[49] miniUI_0.1.1.1 viridisLite_0.3.0 xtable_1.8-4
[52] riverplot_0.10 reticulate_1.18 bit_4.0.4
[55] rsvd_1.0.3 mclust_5.4.7 htmlwidgets_1.5.3
[58] httr_1.4.2 FNN_1.1.3 RColorBrewer_1.1-2
[61] ellipsis_0.3.1 ica_1.0-2 farver_2.0.3
[64] pkgconfig_2.0.3 uwot_0.1.10 deldir_0.2-9
[67] labeling_0.4.2 tidyselect_1.1.0 rlang_0.4.10
[70] reshape2_1.4.4 later_1.1.0.1 cellranger_1.1.0
[73] munsell_0.5.0 tools_4.0.3 cli_2.3.0
[76] generics_0.1.0 broom_0.7.2 ggridges_0.5.3
[79] fastmap_1.1.0 goftest_1.2-2 fs_1.5.0
[82] bit64_4.0.5 tidygraph_1.2.0 fitdistrplus_1.1-3
[85] RANN_2.6.1 pbapply_1.4-3 nlme_3.1-151
[88] sparseMatrixStats_1.2.0 mime_0.10 xml2_1.3.2
[91] rstudioapi_0.13 hdf5r_1.3.3 compiler_4.0.3
[94] plotly_4.9.3 png_0.1-7 spatstat.utils_2.0-0
[97] reprex_0.3.0 tweenr_1.0.1 stringi_1.5.3
[100] RSpectra_0.16-0 lattice_0.20-41 vctrs_0.3.6
[103] pillar_1.4.7 lifecycle_1.0.0 BiocManager_1.30.10
[106] lmtest_0.9-38 RcppAnnoy_0.0.18 BiocNeighbors_1.8.2
[109] data.table_1.13.6 bitops_1.0-6 irlba_2.3.3
[112] httpuv_1.5.5 R6_2.5.0 promises_1.1.1
[115] KernSmooth_2.23-18 gridExtra_2.3 parallelly_1.23.0
[118] codetools_0.2-18 MASS_7.3-53 assertthat_0.2.1
[121] withr_2.4.1 GenomeInfoDbData_1.2.4 hms_0.5.3
[124] mgcv_1.8-33 doSNOW_1.0.19 grid_4.0.3
[127] rpart_4.1-15 beachmat_2.6.2 DelayedMatrixStats_1.12.1 [130] Rtsne_0.15 ggforce_0.3.2 lubridate_1.7.9.2
[133] shiny_1.6.0