welch-lab / liger

R package for integrating and analyzing multiple single-cell datasets
GNU General Public License v3.0
389 stars 78 forks source link

Memory issue with the format of input data #211

Closed tkisss closed 3 years ago

tkisss commented 3 years ago

Hi,

I noticed that online iNMF uses about 70GB on a 100000x20000 raw cells x genes matrix. How can I convert a genes X cells matrix to a .h5 file as input of createLiger for a memory efficient implementation?

Thanks!

cgao90 commented 3 years ago

Hi,

You can either use rhdf5 (https://bioconductor.org/packages/release/bioc/html/rhdf5.html) or hdf5r (https://cran.r-project.org/web/packages/hdf5r/vignettes/hdf5r.html) packages for the conversion. Here is an example using rhdf5:

library(rhdf5)
library(Matrix)

# load dense data matrices (to be integrated)
data1 = readRDS("data1.RDS") # dataset 1
data2 = readRDS("data2.RDS") # dataset 2

# convert them to sparse data matrices 
data1 = as(data1, "CsparseMatrix")
data2 = as(data2, "CsparseMatrix")

# file names for the h5 files to be generated
fname1 = "data1.h5"
fname2 = "data2.h5"

hdf5_files = list(fname1,fname2) 
dgCmat_list = c("data1", "data2") # names of sparse data matrices

# Initialize HDF5 files
h5closeAll()
for (i in 1:length(hdf5_files)){
  h5createFile(hdf5_files[[i]])
  h5createGroup(hdf5_files[[i]], "matrix")
  h5write(get(dgCmat_list[i])@Dimnames[[2]], file=hdf5_files[[i]], name="matrix/barcodes") # cell barcodes
  h5createGroup(hdf5_files[[i]], file.path("matrix", "features"))
  h5write(get(dgCmat_list[i])@Dimnames[[1]], file=hdf5_files[[i]], name="matrix/features/name")
  h5write(dim(get(dgCmat_list[i])), file=hdf5_files[[i]], name="matrix/shape")
  # save x, i, p into h5 file
  h5createDataset(hdf5_files[[i]],"matrix/data",dims=length(get(dgCmat_list[i])@x),storage.mode="double", chunk = 1000)
  h5write(get(dgCmat_list[i])@x, file=hdf5_files[[i]], name="matrix/data", index = list(1:length(get(dgCmat_list[i])@x)))
  h5createDataset(hdf5_files[[i]],"matrix/indices",dims=length(get(dgCmat_list[i])@i),storage.mode="integer", chunk = 1000)
  h5write(get(dgCmat_list[i])@i, file=hdf5_files[[i]], name="matrix/indices",index = list(1:length(get(dgCmat_list[i])@i))) # already zero-indexed.
  h5createDataset(hdf5_files[[i]],"matrix/indptr",dims=length(get(dgCmat_list[i])@p),storage.mode="integer", chunk = 1000)
  h5write(get(dgCmat_list[i])@p, file=hdf5_files[[i]], name="matrix/indptr",index = list(1:length(get(dgCmat_list[i])@p)))
  h5closeAll()
}
tkisss commented 3 years ago

Hi,

Thanks for your help, and I successfully converted the genes X cells matrix to a .h5 file according to your suggestion, but I did not find a reduction of memory usage. Online iNMF took about 70GB even I input a .h5 file.

My code:

data = list(batch1=paste0(input_path, '/batch1.h5', sep=''), batch2=paste0(input_path, '/batch2.h5', sep='')) 
adata = createLiger(data,remove.missing=FALSE) 
adata = normalize(adata,remove.missing = FALSE) 
adata = selectGenes(adata, var.thresh = 0.2, do.plot = F) 
adata = scaleNotCenter(adata) 
Rprof(tf <- paste(output_path, "/rprof.log",sep=''), memory.profiling=TRUE) 
adata = online_iNMF(adata, k = 20, miniBatch_size = 5000, max.epochs = 5) 
Rprof(NULL) 
adata = quantile_norm(adata)

I used Rprof to record memory usage in R and used max(summaryRprof(tf, memory="both")$by.total$mem.total) to get the maximum memory.

I noticed the peak memory usage (Fig 4.) in your paper Iterative Re nement of Cellular Identity from Single-Cell Data Using Online Learning from bioRxiv. My test results are very different from yours. I don't know where the problem is, can you give me some suggestions?

Thanks! Kang Tian

cgao90 commented 3 years ago

Hello Kang,

We used peakRAM package for tracking the peak RAM consumed.

mem <- peakRAM({
  adata = online_iNMF(adata, k = 20, miniBatch_size = 5000, max.epochs = 5) 
})
peak_mem = mem$Peak_RAM_Used_MiB

For better accuracy, you could restart the PC or start a new session on a computer cluster for this process. As for the Rprof approach, here is a discussion related to the interpretation of the output (https://stackoverflow.com/questions/58250126/interpretation-of-memory-profiling-output-of-rprof).