zdebruine / singlet

Single-cell analysis with non-negative matrix factorization
39 stars 12 forks source link

Singlet modifies the original data matrix in the Seurat object #41

Closed TaopengWang closed 6 months ago

TaopengWang commented 7 months ago

Hi Zach,

Thanks a lot for the great package and the RcppML package. I've mainly been using the RcppML package and have just started with the Singlet package. Unfortunately, I ran into several issues in my analysis. It would be great if you could shed some light on the causes and how to fix them.

1. I experienced the same error message posted before:

Error in c_nmf(A, At, tol, maxit, verbose > 2, L1, L2, threads, w_init_this) : 
  argument "threads" is missing, with no default

I tried the fixes in one of the closed issues #38 , but it didn't work for me. Also, the original code looks all good. 🤷

2. I noticed the RunNMF function edits the original data slot. (1) I have some data from multiple batches normalised using sctransform. There are some batch effects related to the scRNA-seq chemistry and sequencing depth. But for now, sctransform appears to have retained the gene expression features of the cell types

DefaultAssay(merged_sct_obj) <- 'SCT'
VlnPlot(merged_sct_obj, features = 'PECAM1', group.by = 'working_annot')

PECAM1.pdf

(2) I then thought to run NMF on the data matrix. If I understand correctly, the RunNMF function will only normalise the data when the data is not normalised. In my case, the data has been normalised. It's not clear why the function would still change it. Also looks like this can impact the NMF analysis and downstream dimension reduction. As you can see, the 3 different tissue types are mostly non-overlapping with altered expression patterns of PECAM1.

# testing if the data slot has been normalised
> v <- merged_sct_obj@assays[['SCT']]@data@x
> sum(as.integer(v)) == sum(v)
[1] FALSE
# so it has been normalised

# run NMF
# This failed due to the error mentioned above, but the k optimisation was completed.
merged_sct_obj_nmf <- RunNMF(merged_sct_obj, assay = 'SCT')                    
# redo NMF using the selected k
merged_sct_obj_nmf <- RunNMF(merged_sct_obj, assay = 'SCT', k = 24)

# plot marker
DefaultAssay(merged_sct_obj_nmf) <- 'SCT'
VlnPlot(merged_sct_obj_nmf, features = 'PECAM1', group.by = 'working_annot')

# same when plotting gene expression on UMAP
FeaturePlot(merged_sct_obj, slot = 'data', features = 'PECAM1', order = T, reduction = 'jnmf_all')

PECAM1_after_NMF_SCT_assay.pdf PECAM1_UMAP_after_NMF_SCT_Assay.pdf UMAP_NMF_tissue_type.pdf

Any insight would be much appreciated!

Tony

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /directflow/TumourProgressionGroupTemp/projects/config/conda/envs/R_4.3/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

time zone: Australia/Sydney
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] singlet_0.99.38     RcppEigen_0.3.3.9.4 RcppML_0.5.6       
 [4] SeuratObject_4.1.3  Seurat_4.3.0.1      lubridate_1.9.2    
 [7] forcats_1.0.0       stringr_1.5.0       dplyr_1.1.3        
[10] purrr_1.0.2         readr_2.1.4         tidyr_1.3.0        
[13] tibble_3.2.1        ggplot2_3.4.3       tidyverse_2.0.0    

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          jsonlite_1.8.7             
  [3] magrittr_2.0.3              ggbeeswarm_0.7.2           
  [5] spatstat.utils_3.0-3        farver_2.1.1               
  [7] zlibbioc_1.46.0             vctrs_0.6.3                
  [9] ROCR_1.0-11                 DelayedMatrixStats_1.22.6  
 [11] spatstat.explore_3.2-3      RCurl_1.98-1.12            
 [13] S4Arrays_1.2.0              htmltools_0.5.6            
 [15] SparseArray_1.2.0           sctransform_0.4.1          
 [17] parallelly_1.36.0           KernSmooth_2.23-22         
 [19] htmlwidgets_1.6.2           ica_1.0-3                  
 [21] plyr_1.8.8                  plotly_4.10.2              
 [23] zoo_1.8-12                  igraph_1.5.1               
 [25] mime_0.12                   lifecycle_1.0.3            
 [27] pkgconfig_2.0.3             Matrix_1.6-1               
 [29] R6_2.5.1                    fastmap_1.1.1              
 [31] GenomeInfoDbData_1.2.10     MatrixGenerics_1.12.3      
 [33] fitdistrplus_1.1-11         future_1.33.0              
 [35] shiny_1.7.5                 digest_0.6.33              
 [37] colorspace_2.1-0            S4Vectors_0.38.2           
 [39] patchwork_1.1.3             tensor_1.5                 
 [41] irlba_2.3.5.1               GenomicRanges_1.52.0       
 [43] labeling_0.4.3              progressr_0.14.0           
 [45] fansi_1.0.4                 spatstat.sparse_3.0-2      
 [47] timechange_0.2.0            httr_1.4.7                 
 [49] polyclip_1.10-4             abind_1.4-5                
 [51] compiler_4.3.0              withr_2.5.1                
 [53] BiocParallel_1.34.2         MASS_7.3-60                
 [55] DelayedArray_0.28.0         tools_4.3.0                
 [57] vipor_0.4.5                 lmtest_0.9-40              
 [59] beeswarm_0.4.0              httpuv_1.6.11              
 [61] msigdbr_7.5.1               future.apply_1.11.0        
 [63] goftest_1.2-3               glmGamPoi_1.12.2           
 [65] glue_1.6.2                  nlme_3.1-163               
 [67] promises_1.2.1              grid_4.3.0                 
 [69] Rtsne_0.16                  cluster_2.1.4              
 [71] reshape2_1.4.4              fgsea_1.26.0               
 [73] generics_0.1.3              gtable_0.3.4               
 [75] spatstat.data_3.0-1         tzdb_0.4.0                 
 [77] data.table_1.14.8           hms_1.1.3                  
 [79] XVector_0.40.0              sp_2.0-0                   
 [81] utf8_1.2.3                  BiocGenerics_0.46.0        
 [83] spatstat.geom_3.2-5         RcppAnnoy_0.0.21           
 [85] ggrepel_0.9.3               RANN_2.6.1                 
 [87] pillar_1.9.0                babelgene_22.9             
 [89] limma_3.56.2                later_1.3.1                
 [91] splines_4.3.0               lattice_0.21-8             
 [93] renv_1.0.2                  survival_3.5-7             
 [95] deldir_1.0-9                tidyselect_1.2.0           
 [97] SingleCellExperiment_1.22.0 miniUI_0.1.1.1             
 [99] pbapply_1.7-2               knitr_1.44                 
[101] gridExtra_2.3               IRanges_2.34.1             
[103] SummarizedExperiment_1.30.2 scattermore_1.2            
[105] stats4_4.3.0                xfun_0.40                  
[107] Biobase_2.60.0              matrixStats_1.0.0          
[109] stringi_1.7.12              lazyeval_0.2.2             
[111] codetools_0.2-19            cli_3.6.1                  
[113] uwot_0.1.16                 xtable_1.8-4               
[115] reticulate_1.32.0           munsell_0.5.0              
[117] GenomeInfoDb_1.36.3         Rcpp_1.0.11                
[119] globals_0.16.2              spatstat.random_3.1-6      
[121] png_0.1-8                   ggrastr_1.0.2              
[123] parallel_4.3.0              ellipsis_0.3.2             
[125] sparseMatrixStats_1.12.2    bitops_1.0-7               
[127] listenv_0.9.0               viridisLite_0.4.2          
[129] scales_1.2.1                ggridges_0.5.4             
[131] crayon_1.5.2                leiden_0.4.3               
[133] rlang_1.1.1                 cowplot_1.1.1              
[135] fastmatch_1.1-4            
TaopengWang commented 7 months ago

A quick update: in the examples above, I've been using the SCT assay. This time, I created a new Seurat object with the SCT normalised counts and data. The gene expression data remains unchanged after NMF analysis. However, the 1st error complaining no threads specified still exists.

zdebruine commented 7 months ago

@TaopengWang Got it. So in the case of your first issue, yes I haven't gotten to investigating this yet but it definitely needs attention -- I am not sure what is going on with the threads argument in those functions.

You are trying to use input from SCT to NMF? In practice, this is not ideal because SCTransform adds back residuals from a regression on the data, which makes it very attractive for PCA but horrible for NMF. NMF as a method inherently solves some of the problems that SCTransform tries to address, such as denoising, dealing with heterogeneity of patterns across samples, and suppression of some types of technical artifacts in a way that they won't be detected by PCA. In short, SCTransform was purpose-built for PCA, not NMF. Just use a standard log-normalization, and let NMF pull out batch effect factors. You can identify those batch factors by looking for mean factor loadings in each batch and ignore those in downstream analysis (i.e. graph-based clustering, DE analysis, etc.).

Are you saying that your original Seurat object is changed just by calling RunNMF? I don't think that's the case. What behavior is to be expected right now is that if the data in @assays$RNA@data is integral, it will be standard log-normalized, otherwise it will be used as is. See lines 66-70 of RunNMF.R. What we could do (and then this would be a feature request) is allow the user to specify the assay AND slot which they wish to use, and we don't check for normalization, we just throw a console warning if it does not appear to be properly normalized. Would that fix your issue? For now, just move your SCT Assay to the $RNA@data slot and you should get the behavior you expect.

Also just a cautionary note that tissues are different and you should expect differences between them -- just find the NMF factor that captures tissue-specific variation and exclude it from the UMAP (e.g. set RunUMAP(..., dims = c(factors_that_arenot_tissue_specific)).

IsaacDiaz026 commented 7 months ago

@TaopengWang Got it. So in the case of your first issue, yes I haven't gotten to investigating this yet but it definitely needs attention -- I am not sure what is going on with the threads argument in those functions.

You are trying to use input from SCT to NMF? In practice, this is not ideal because SCTransform adds back residuals from a regression on the data, which makes it very attractive for PCA but horrible for NMF. NMF as a method inherently solves some of the problems that SCTransform tries to address, such as denoising, dealing with heterogeneity of patterns across samples, and suppression of some types of technical artifacts in a way that they won't be detected by PCA. In short, SCTransform was purpose-built for PCA, not NMF. Just use a standard log-normalization, and let NMF pull out batch effect factors. You can identify those batch factors by looking for mean factor loadings in each batch and ignore those in downstream analysis (i.e. graph-based clustering, DE analysis, etc.).

Are you saying that your original Seurat object is changed just by calling RunNMF? I don't think that's the case. What behavior is to be expected right now is that if the data in @assays$RNA@data is integral, it will be standard log-normalized, otherwise it will be used as is. See lines 66-70 of RunNMF.R. What we could do (and then this would be a feature request) is allow the user to specify the assay AND slot which they wish to use, and we don't check for normalization, we just throw a console warning if it does not appear to be properly normalized. Would that fix your issue? For now, just move your SCT Assay to the $RNA@data slot and you should get the behavior you expect.

Also just a cautionary note that tissues are different and you should expect differences between them -- just find the NMF factor that captures tissue-specific variation and exclude it from the UMAP (e.g. set RunUMAP(..., dims = c(factors_that_arenot_tissue_specific)).

Does this mean we should also not use TF-IDF normalization before NMF?

TaopengWang commented 7 months ago

Hi @zdebruine ,

Thank you very much for the fast response and the detailed explanation about using SCT data for NMF.

With regards to the issue about calling RunNMF modifying the original data, I did some more testing and noticed it was actually related to the split.by argument which makes sense. On the other hand, calling RunNMF without this argument will return the expected NMF results without any modification to the count matrix.

Thanks a lot! Tony

zdebruine commented 7 months ago

@TaopengWang Aha, I see. I can fix that issue with modify-in-place behavior when setting split.by in the next version.

IsaacDiaz026 commented 7 months ago

Hi @zdebruine ,

Thank you very much for the fast response and the detailed explanation about using SCT data for NMF.

With regards to the issue about calling RunNMF modifying the original data, I did some more testing and noticed it was actually related to the split.by argument which makes sense. On the other hand, calling RunNMF without this argument will return the expected NMF results without any modification to the count matrix.

Thanks a lot! Tony

Hi Tony, were you able to get past the argument "threads" is missing, with no default - error?

zdebruine commented 7 months ago

@IsaacDiaz026 I'll try to fix this early next week, hopefully.

ryanhchung commented 7 months ago

Hi @zdebruine, I also experienced the 'argument "threads" is missing, with no default'. Any updates? Thank you!

pankomah commented 6 months ago

Hi @zdebruine, Thanks for this excellent package. Adding that I also got the same error others have described above: "Error in c_nmf(A, At, tol, maxit, verbose > 2, L1, L2, threads, w_init_this) : argument "threads" is missing, with no default". Would love to know if there are any updates. Thanks very much, Pierre

sessionInfo() R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] ggplot2_3.4.4 singlet_0.99.38 RcppEigen_0.3.3.9.4 dplyr_1.1.4 Seurat_5.0.1 RcppML_0.5.6 SeuratObject_5.0.1 [8] sp_2.1-3

loaded via a namespace (and not attached): [1] RcppAnnoy_0.0.22 splines_4.3.2 later_1.3.2 bitops_1.0-7 tibble_3.2.1
[6] polyclip_1.10-6 fastDummies_1.7.3 lifecycle_1.0.4 doParallel_1.0.17 globals_0.16.2
[11] lattice_0.22-5 MASS_7.3-60 magrittr_2.0.3 limma_3.58.1 plotly_4.10.4
[16] remotes_2.4.2.1 httpuv_1.6.14 sctransform_0.4.1 spam_2.10-0 sessioninfo_1.2.2
[21] pkgbuild_1.4.3 spatstat.sparse_3.0-3 reticulate_1.35.0 cowplot_1.1.3 pbapply_1.7-2
[26] RColorBrewer_1.1-3 abind_1.4-5 pkgload_1.3.4 zlibbioc_1.48.0 Rtsne_0.17
[31] GenomicRanges_1.54.1 purrr_1.0.2 BiocGenerics_0.48.1 msigdbr_7.5.1 RCurl_1.98-1.14
[36] circlize_0.4.16 GenomeInfoDbData_1.2.11 IRanges_2.36.0 S4Vectors_0.40.2 ggrepel_0.9.5
[41] irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.0-4 goftest_1.2-3 RSpectra_0.16-1
[46] spatstat.random_3.2-2 fitdistrplus_1.1-11 parallelly_1.37.0 leiden_0.4.3.1 codetools_0.2-19
[51] DelayedArray_0.28.0 tidyselect_1.2.0 shape_1.4.6 matrixStats_1.2.0 stats4_4.3.2
[56] spatstat.explore_3.2-6 jsonlite_1.8.8 GetoptLong_1.0.5 ellipsis_0.3.2 progressr_0.14.0
[61] ggridges_0.5.6 survival_3.5-7 iterators_1.0.14 foreach_1.5.2 tools_4.3.2
[66] ica_1.0-3 Rcpp_1.0.12 glue_1.7.0 gridExtra_2.3 SparseArray_1.2.4
[71] xfun_0.42 usethis_2.2.2 MatrixGenerics_1.14.0 GenomeInfoDb_1.38.6 withr_3.0.0
[76] BiocManager_1.30.22 fastmap_1.1.1 fansi_1.0.6 digest_0.6.34 R6_2.5.1
[81] mime_0.12 colorspace_2.1-0 scattermore_1.2 tensor_1.5 spatstat.data_3.0-4
[86] utf8_1.2.4 tidyr_1.3.1 generics_0.1.3 data.table_1.15.0 httr_1.4.7
[91] htmlwidgets_1.6.4 S4Arrays_1.2.0 uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.4
[96] ComplexHeatmap_2.18.0 lmtest_0.9-40 SingleCellExperiment_1.24.0 XVector_0.42.0 htmltools_0.5.7
[101] profvis_0.3.8 dotCall64_1.1-1 fgsea_1.28.0 clue_0.3-65 scales_1.3.0
[106] Biobase_2.62.0 png_0.1-8 knitr_1.45 rstudioapi_0.15.0 reshape2_1.4.4
[111] rjson_0.2.21 curl_5.2.0 nlme_3.1-163 zoo_1.8-12 cachem_1.0.8
[116] GlobalOptions_0.1.2 stringr_1.5.1 KernSmooth_2.23-22 parallel_4.3.2 miniUI_0.1.1.1
[121] pillar_1.9.0 grid_4.3.2 vctrs_0.6.5 RANN_2.6.1 urlchecker_1.0.1
[126] promises_1.2.1 xtable_1.8-4 cluster_2.1.4 cli_3.6.2 compiler_4.3.2
[131] rlang_1.1.3 crayon_1.5.2 future.apply_1.11.1 plyr_1.8.9 fs_1.6.3
[136] stringi_1.8.3 viridisLite_0.4.2 deldir_2.0-2 BiocParallel_1.36.0 babelgene_22.9
[141] munsell_0.5.0 lazyeval_0.2.2 devtools_2.4.5 spatstat.geom_3.2-8 Matrix_1.6-5
[146] RcppHNSW_0.6.0 patchwork_1.2.0 future_1.33.1 statmod_1.5.0 shiny_1.8.0
[151] SummarizedExperiment_1.32.0 ROCR_1.0-11 igraph_2.0.2 memoise_2.0.1 fastmatch_1.1-4

zdebruine commented 6 months ago

Yes, I see these issues. I intend to fix the issue about threads ASAP. Will edit this and close with comment once I can do that (hopefully Friday).

massonix commented 6 months ago

Hi Zach,

Thanks for this excellent package! I am also experiencing the same issue. This is the code I run:

seurat <- singlet::RunNMF(object = seurat, assay = "ATAC_tumoral", threads = 8)

And this is the error message

Unmasking test set
Fitting final model at k = 3 
Error in c_nmf(A, At, tol, maxit, verbose > 2, L1, L2, threads, w_init_this) : 
  argument "threads" is missing, with no default

For reference, I am running Seurat v5 in a SLURM cluster.

I also have the same question as @IsaacDiaz026 : should we use TF-IDF as normalization for scATAC-seq data prior to running NMF?

Thanks a lot!

Ramon

zdebruine commented 6 months ago

Threads issue resolved with https://github.com/zdebruine/singlet/commit/c98b871cf5540693be49e37bebc80416ff266a37.

@massonix @IsaacDiaz026 TF-IDF of ATAC data prior to NMF is an appropriate pre-processing step, as it simply scales data based on marginal occurrences/frequencies of values. TF-IDF does not add back residuals from regression, it does not make assumptions about the distribution of the data, and is not an attempt to de-noise or "intelligently clean" the data for dimension reduction. It works well for PCA and NMF, and preprocessing for dimension reduction in general. Just my perspective, doesn't mean TF-IDF is the best preprocessing approach for ATAC.

zdebruine commented 6 months ago

Modify-in-place behavior when using RunNMF with split.by is resolved with https://github.com/zdebruine/singlet/commit/9901661692860b9ec247c218a731a0b4752bf749. This was due to the weird copy-on-modify behavior of R objects.