Closed TaopengWang closed 6 months ago
A quick update: in the examples above, I've been using the SCT assay. This time, I created a new Seurat object with the SCT normalised counts and data. The gene expression data remains unchanged after NMF analysis. However, the 1st error complaining no threads
specified still exists.
@TaopengWang Got it. So in the case of your first issue, yes I haven't gotten to investigating this yet but it definitely needs attention -- I am not sure what is going on with the threads
argument in those functions.
You are trying to use input from SCT
to NMF? In practice, this is not ideal because SCTransform adds back residuals from a regression on the data, which makes it very attractive for PCA but horrible for NMF. NMF as a method inherently solves some of the problems that SCTransform tries to address, such as denoising, dealing with heterogeneity of patterns across samples, and suppression of some types of technical artifacts in a way that they won't be detected by PCA. In short, SCTransform was purpose-built for PCA, not NMF. Just use a standard log-normalization, and let NMF pull out batch effect factors. You can identify those batch factors by looking for mean factor loadings in each batch and ignore those in downstream analysis (i.e. graph-based clustering, DE analysis, etc.).
Are you saying that your original Seurat object is changed just by calling RunNMF? I don't think that's the case. What behavior is to be expected right now is that if the data in @assays$RNA@data
is integral, it will be standard log-normalized, otherwise it will be used as is. See lines 66-70 of RunNMF.R. What we could do (and then this would be a feature request) is allow the user to specify the assay AND slot which they wish to use, and we don't check for normalization, we just throw a console warning if it does not appear to be properly normalized. Would that fix your issue? For now, just move your SCT Assay to the $RNA@data
slot and you should get the behavior you expect.
Also just a cautionary note that tissues are different and you should expect differences between them -- just find the NMF factor that captures tissue-specific variation and exclude it from the UMAP (e.g. set RunUMAP(..., dims = c(factors_that_arenot_tissue_specific))
.
@TaopengWang Got it. So in the case of your first issue, yes I haven't gotten to investigating this yet but it definitely needs attention -- I am not sure what is going on with the
threads
argument in those functions.You are trying to use input from
SCT
to NMF? In practice, this is not ideal because SCTransform adds back residuals from a regression on the data, which makes it very attractive for PCA but horrible for NMF. NMF as a method inherently solves some of the problems that SCTransform tries to address, such as denoising, dealing with heterogeneity of patterns across samples, and suppression of some types of technical artifacts in a way that they won't be detected by PCA. In short, SCTransform was purpose-built for PCA, not NMF. Just use a standard log-normalization, and let NMF pull out batch effect factors. You can identify those batch factors by looking for mean factor loadings in each batch and ignore those in downstream analysis (i.e. graph-based clustering, DE analysis, etc.).Are you saying that your original Seurat object is changed just by calling RunNMF? I don't think that's the case. What behavior is to be expected right now is that if the data in
@assays$RNA@data
is integral, it will be standard log-normalized, otherwise it will be used as is. See lines 66-70 of RunNMF.R. What we could do (and then this would be a feature request) is allow the user to specify the assay AND slot which they wish to use, and we don't check for normalization, we just throw a console warning if it does not appear to be properly normalized. Would that fix your issue? For now, just move your SCT Assay to the$RNA@data
slot and you should get the behavior you expect.Also just a cautionary note that tissues are different and you should expect differences between them -- just find the NMF factor that captures tissue-specific variation and exclude it from the UMAP (e.g. set
RunUMAP(..., dims = c(factors_that_arenot_tissue_specific))
.
Does this mean we should also not use TF-IDF normalization before NMF?
Hi @zdebruine ,
Thank you very much for the fast response and the detailed explanation about using SCT data for NMF.
With regards to the issue about calling RunNMF modifying the original data, I did some more testing and noticed it was actually related to the split.by
argument which makes sense. On the other hand, calling RunNMF without this argument will return the expected NMF results without any modification to the count matrix.
Thanks a lot! Tony
@TaopengWang Aha, I see. I can fix that issue with modify-in-place behavior when setting split.by
in the next version.
Hi @zdebruine ,
Thank you very much for the fast response and the detailed explanation about using SCT data for NMF.
With regards to the issue about calling RunNMF modifying the original data, I did some more testing and noticed it was actually related to the
split.by
argument which makes sense. On the other hand, calling RunNMF without this argument will return the expected NMF results without any modification to the count matrix.Thanks a lot! Tony
Hi Tony, were you able to get past the argument "threads" is missing, with no default - error?
@IsaacDiaz026 I'll try to fix this early next week, hopefully.
Hi @zdebruine, I also experienced the 'argument "threads" is missing, with no default'. Any updates? Thank you!
Hi @zdebruine, Thanks for this excellent package. Adding that I also got the same error others have described above: "Error in c_nmf(A, At, tol, maxit, verbose > 2, L1, L2, threads, w_init_this) : argument "threads" is missing, with no default". Would love to know if there are any updates. Thanks very much, Pierre
sessionInfo() R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC tzcode source: system (glibc)
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] ggplot2_3.4.4 singlet_0.99.38 RcppEigen_0.3.3.9.4 dplyr_1.1.4 Seurat_5.0.1 RcppML_0.5.6 SeuratObject_5.0.1 [8] sp_2.1-3
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.2 later_1.3.2 bitops_1.0-7 tibble_3.2.1
[6] polyclip_1.10-6 fastDummies_1.7.3 lifecycle_1.0.4 doParallel_1.0.17 globals_0.16.2
[11] lattice_0.22-5 MASS_7.3-60 magrittr_2.0.3 limma_3.58.1 plotly_4.10.4
[16] remotes_2.4.2.1 httpuv_1.6.14 sctransform_0.4.1 spam_2.10-0 sessioninfo_1.2.2
[21] pkgbuild_1.4.3 spatstat.sparse_3.0-3 reticulate_1.35.0 cowplot_1.1.3 pbapply_1.7-2
[26] RColorBrewer_1.1-3 abind_1.4-5 pkgload_1.3.4 zlibbioc_1.48.0 Rtsne_0.17
[31] GenomicRanges_1.54.1 purrr_1.0.2 BiocGenerics_0.48.1 msigdbr_7.5.1 RCurl_1.98-1.14
[36] circlize_0.4.16 GenomeInfoDbData_1.2.11 IRanges_2.36.0 S4Vectors_0.40.2 ggrepel_0.9.5
[41] irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.0-4 goftest_1.2-3 RSpectra_0.16-1
[46] spatstat.random_3.2-2 fitdistrplus_1.1-11 parallelly_1.37.0 leiden_0.4.3.1 codetools_0.2-19
[51] DelayedArray_0.28.0 tidyselect_1.2.0 shape_1.4.6 matrixStats_1.2.0 stats4_4.3.2
[56] spatstat.explore_3.2-6 jsonlite_1.8.8 GetoptLong_1.0.5 ellipsis_0.3.2 progressr_0.14.0
[61] ggridges_0.5.6 survival_3.5-7 iterators_1.0.14 foreach_1.5.2 tools_4.3.2
[66] ica_1.0-3 Rcpp_1.0.12 glue_1.7.0 gridExtra_2.3 SparseArray_1.2.4
[71] xfun_0.42 usethis_2.2.2 MatrixGenerics_1.14.0 GenomeInfoDb_1.38.6 withr_3.0.0
[76] BiocManager_1.30.22 fastmap_1.1.1 fansi_1.0.6 digest_0.6.34 R6_2.5.1
[81] mime_0.12 colorspace_2.1-0 scattermore_1.2 tensor_1.5 spatstat.data_3.0-4
[86] utf8_1.2.4 tidyr_1.3.1 generics_0.1.3 data.table_1.15.0 httr_1.4.7
[91] htmlwidgets_1.6.4 S4Arrays_1.2.0 uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.4
[96] ComplexHeatmap_2.18.0 lmtest_0.9-40 SingleCellExperiment_1.24.0 XVector_0.42.0 htmltools_0.5.7
[101] profvis_0.3.8 dotCall64_1.1-1 fgsea_1.28.0 clue_0.3-65 scales_1.3.0
[106] Biobase_2.62.0 png_0.1-8 knitr_1.45 rstudioapi_0.15.0 reshape2_1.4.4
[111] rjson_0.2.21 curl_5.2.0 nlme_3.1-163 zoo_1.8-12 cachem_1.0.8
[116] GlobalOptions_0.1.2 stringr_1.5.1 KernSmooth_2.23-22 parallel_4.3.2 miniUI_0.1.1.1
[121] pillar_1.9.0 grid_4.3.2 vctrs_0.6.5 RANN_2.6.1 urlchecker_1.0.1
[126] promises_1.2.1 xtable_1.8-4 cluster_2.1.4 cli_3.6.2 compiler_4.3.2
[131] rlang_1.1.3 crayon_1.5.2 future.apply_1.11.1 plyr_1.8.9 fs_1.6.3
[136] stringi_1.8.3 viridisLite_0.4.2 deldir_2.0-2 BiocParallel_1.36.0 babelgene_22.9
[141] munsell_0.5.0 lazyeval_0.2.2 devtools_2.4.5 spatstat.geom_3.2-8 Matrix_1.6-5
[146] RcppHNSW_0.6.0 patchwork_1.2.0 future_1.33.1 statmod_1.5.0 shiny_1.8.0
[151] SummarizedExperiment_1.32.0 ROCR_1.0-11 igraph_2.0.2 memoise_2.0.1 fastmatch_1.1-4
Yes, I see these issues. I intend to fix the issue about threads
ASAP. Will edit this and close with comment once I can do that (hopefully Friday).
Hi Zach,
Thanks for this excellent package! I am also experiencing the same issue. This is the code I run:
seurat <- singlet::RunNMF(object = seurat, assay = "ATAC_tumoral", threads = 8)
And this is the error message
Unmasking test set
Fitting final model at k = 3
Error in c_nmf(A, At, tol, maxit, verbose > 2, L1, L2, threads, w_init_this) :
argument "threads" is missing, with no default
For reference, I am running Seurat v5 in a SLURM cluster.
I also have the same question as @IsaacDiaz026 : should we use TF-IDF as normalization for scATAC-seq data prior to running NMF?
Thanks a lot!
Ramon
Threads issue resolved with https://github.com/zdebruine/singlet/commit/c98b871cf5540693be49e37bebc80416ff266a37.
@massonix @IsaacDiaz026 TF-IDF of ATAC data prior to NMF is an appropriate pre-processing step, as it simply scales data based on marginal occurrences/frequencies of values. TF-IDF does not add back residuals from regression, it does not make assumptions about the distribution of the data, and is not an attempt to de-noise or "intelligently clean" the data for dimension reduction. It works well for PCA and NMF, and preprocessing for dimension reduction in general. Just my perspective, doesn't mean TF-IDF is the best preprocessing approach for ATAC.
Modify-in-place behavior when using RunNMF
with split.by
is resolved with https://github.com/zdebruine/singlet/commit/9901661692860b9ec247c218a731a0b4752bf749. This was due to the weird copy-on-modify behavior of R objects.
Hi Zach,
Thanks a lot for the great package and the RcppML package. I've mainly been using the RcppML package and have just started with the Singlet package. Unfortunately, I ran into several issues in my analysis. It would be great if you could shed some light on the causes and how to fix them.
1. I experienced the same error message posted before:
I tried the fixes in one of the closed issues #38 , but it didn't work for me. Also, the original code looks all good. 🤷
2. I noticed the
RunNMF
function edits the original data slot. (1) I have some data from multiple batches normalised usingsctransform
. There are some batch effects related to the scRNA-seq chemistry and sequencing depth. But for now,sctransform
appears to have retained the gene expression features of the cell typesPECAM1.pdf
(2) I then thought to run NMF on the data matrix. If I understand correctly, the
RunNMF
function will only normalise the data when the data is not normalised. In my case, the data has been normalised. It's not clear why the function would still change it. Also looks like this can impact the NMF analysis and downstream dimension reduction. As you can see, the 3 different tissue types are mostly non-overlapping with altered expression patterns of PECAM1.PECAM1_after_NMF_SCT_assay.pdf PECAM1_UMAP_after_NMF_SCT_Assay.pdf UMAP_NMF_tissue_type.pdf
Any insight would be much appreciated!
Tony