Closed TdzBAS closed 9 months ago
Hi Tolga,
Admittedly this step is time-consuming. I typically run this step on my scientific computing platform (SCP). In this case, I submit one job for each cell via Slurm Workload Manager (v20.02.0) on CentOS Linux 7 (Core). Therefore all cells get processed in parallel and it typically takes ~24 hours to complete.
Do you have access to a SCP?
Sean
Hi Sean,
thanks for your reply!
I have access to SGE, and can submit arrayjobs. But unfortunately only one job gets processed at a time. Which is the cause for the massive runtimes. For now I am a bit stucked in how to handle the runtime, so that I can get the results in the next two weeks.
Best Tolga
Hi Tolga,
I have written some scripts to tabulate and compute the intron counts "more efficiently".
The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt), example input (sample1_1.txt, sample1_2.txt), and example output (example_input_sparse.rdata) on Google Drive (link expires this Sunday: https://drive.google.com/file/d/1OMf5C4NjKAE7Ut2e_B65D7BsbWXpJW0R/view?usp=sharing).
We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.
The process will still take time, ~72 hours or more, but it circumvents the need to process one cell at a time.
Sean
From: TdzBAS @.> Sent: 14 August 2023 15:10 To: wenweixiong/MARVEL @.> Cc: Wen, Sean @.>; Comment @.> Subject: Re: [wenweixiong/MARVEL] Computing intron counts with script_03 takes ages (Issue #21)
Hi Sean,
thanks for your reply!
I have access to SGE, and can submit arrayjobs. But unfortunately only one job gets processed at a time. Which is the cause for the massive runtimes. For now I am a bit stucked in how to handle the runtime, so that I can get the results in the next two weeks.
Best Tolga
— Reply to this email directly, view it on GitHubhttps://github.com/wenweixiong/MARVEL/issues/21#issuecomment-1677393109, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKZEX3KNCHNZDKU7T24QQIDXVIWW5ANCNFSM6AAAAAA3F5ZMW4. You are receiving this because you commented.Message ID: @.***>
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com
library(data.table) library(Matrix)
################################################################# ################## RETRIEVE INTRON COORDINATES ################## ################################################################
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/" file <- "sample1_1.txt" df <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))
names(df) <- c("chr", "intron_start", "intron_end", "base", "count")
coord.intron <- df$intron_start + df$base coord.intron <- paste(df$chr, coord.intron, sep=":")
################################################################ ################### TABULATE PER BASE COUNTS ################### ################################################################
files <- c("sample1_1.txt", "sample1_2.txt")
.list <- list()
pb <- txtProgressBar(1, length(files), style=3)
for(i in 1:length(files)) {
# Read example file for 1 sample
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/"
file <- files[i]
df <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))
# Retrieve counts
df <- df[,"V5",drop=FALSE]
# Annotate sample id
sample.id <- gsub(".txt", "", files[i], fixed=TRUE)
names(df) <- sample.id
# Annotate intron counts
df$coord.intron <- coord.intron
# Keep unique values
df <- unique(df)
# Annotate row names
row.names(df) <- df$coord.intron
df$coord.intron <- NULL
# Save into list
.list[[i]] <- df
# Track progress
setTxtProgressBar(pb, i)
}
df <- do.call(cbind.data.frame, .list)
df.sparse <- Matrix(as.matrix(df), sparse=TRUE)
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/" file <- "example_input_sparse.rdata" save(df.sparse, file=paste(path, file, sep=""))
library(Matrix)
# Example count matrix
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/"
file <- "example_input_sparse.rdata"
df <- local(get(load(paste(path, file, sep=""))))
# Intron coordinates
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/"
file <- "RI_Coordinates.bed"
bed <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)
region.counts.df.list <- list()
pb <- txtProgressBar(1, nrow(bed), style=3)
for(i in 1:nrow(bed)) {
bed.small <- bed[i, ] range <- seq(from=bed.small$upstreamEE + 1, to=bed.small$downstreamES, by=1) coords <- paste(bed.small$chr, range, sep=":")
counts.total <- Matrix::colSums(df[coords, ])
coord.intron <- paste(bed.small$chr, bed.small$upstreamEE+1, bed.small$downstreamES, sep=":") results <- as.data.frame(t(as.data.frame(counts.total))) . <- data.frame("coord.intron"=coord.intron) results <- cbind.data.frame(., results)
region.counts.df.list[[i]] <- results
setTxtProgressBar(pb, i)
}
region.counts.df <- do.call(rbind.data.frame, region.counts.df.list) region.counts.df <- unique(region.counts.df)
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/" file <- "Counts_by_Region.txt" write.table(region.counts.df, paste(path, file, sep=""), sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)
Hi Tolga,
I have written some scripts to tabulate and compute the intron counts "more efficiently".
The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt) and example input (sample1_1.txt, sample1_2.txt, Google Drive link )
script_01_create_example_file.txt
sample1_1.txt.zip sample1_2.txt.zip
We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.
script_02_tabulate_intron_counts.txt
The process will still take time, but it circumvents the need to process one cell at a time.
Sean
Hi Sean,
Big thanks for your help! I will try it out tomorrow and will let you know how it went!
Best,
Tolga
Sean Wen @.***> schrieb am Mi. 23. Aug. 2023 um 19:53:
Hi Tolga,
I have written some scripts to tabulate and compute the intron counts "more efficiently".
The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt) and example input (sample1_1.txt, sample1_2.txt, Google Drive link )
script_01_create_example_file.txt https://github.com/wenweixiong/MARVEL/files/12421962/script_01_create_example_file.txt
sample1_1.txt.zip https://github.com/wenweixiong/MARVEL/files/12421969/sample1_1.txt.zip sample1_2.txt.zip https://github.com/wenweixiong/MARVEL/files/12421974/sample1_2.txt.zip
We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.
script_02_tabulate_intron_counts.txt https://github.com/wenweixiong/MARVEL/files/12421976/script_02_tabulate_intron_counts.txt
The process will still take time, but it circumvents the need to process one cell at a time.
Sean
— Reply to this email directly, view it on GitHub https://github.com/wenweixiong/MARVEL/issues/21#issuecomment-1690390038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVYUMGFRVUQUU7KXJWE46ZDXWY7RFANCNFSM6AAAAAA3F5ZMW4 . You are receiving this because you authored the thread.Message ID: @.***>
Hi Sean,
just a quick questions about the usage of script1. Is it correct that I must adapt the files list like this, when I have 400 samples, where the first 200 samples belongs to cellgroup a and the last 200 samples belongs to cellgroup b: files <- c("sample1.txt", "sample2.txt", .., "sample399.txt", "sample400.txt") ?
thanks! Best, Tolga
Am Mi., 23. Aug. 2023 um 19:58 Uhr schrieb Tolga @.***>:
Hi Sean,
Big thanks for your help! I will try it out tomorrow and will let you know how it went!
Best,
Tolga
Sean Wen @.***> schrieb am Mi. 23. Aug. 2023 um 19:53:
Hi Tolga,
I have written some scripts to tabulate and compute the intron counts "more efficiently".
The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt) and example input (sample1_1.txt, sample1_2.txt, Google Drive link )
script_01_create_example_file.txt https://github.com/wenweixiong/MARVEL/files/12421962/script_01_create_example_file.txt
sample1_1.txt.zip https://github.com/wenweixiong/MARVEL/files/12421969/sample1_1.txt.zip sample1_2.txt.zip https://github.com/wenweixiong/MARVEL/files/12421974/sample1_2.txt.zip
We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.
script_02_tabulate_intron_counts.txt https://github.com/wenweixiong/MARVEL/files/12421976/script_02_tabulate_intron_counts.txt
The process will still take time, but it circumvents the need to process one cell at a time.
Sean
— Reply to this email directly, view it on GitHub https://github.com/wenweixiong/MARVEL/issues/21#issuecomment-1690390038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVYUMGFRVUQUU7KXJWE46ZDXWY7RFANCNFSM6AAAAAA3F5ZMW4 . You are receiving this because you authored the thread.Message ID: @.***>
Hi Tolga,
The order of the samples/files shouldn't matter at this stage.
Sean
Hi Sean,
thanks again for your nice tool! It worked out pretty well, since last time. But now I am encountering some heavy efficiency issues for my large dataset. I want to compute the intron counts with your script_03. I am using SGE and for one sample it takes hours to calculate it. In total I have 2000 samples. So the computation would take months.. Do you know how to speed things up?
This is the script I use: script3.txt
and one sample file: sample1.zip
Glad for every help! Thanks Sean!
Best, Tolga