wenweixiong / MARVEL

38 stars 9 forks source link

Computing intron counts with script_03 takes ages #21

Closed TdzBAS closed 9 months ago

TdzBAS commented 11 months ago

Hi Sean,

thanks again for your nice tool! It worked out pretty well, since last time. But now I am encountering some heavy efficiency issues for my large dataset. I want to compute the intron counts with your script_03. I am using SGE and for one sample it takes hours to calculate it. In total I have 2000 samples. So the computation would take months.. Do you know how to speed things up?

This is the script I use: script3.txt

and one sample file: sample1.zip

Glad for every help! Thanks Sean!

Best, Tolga

wenweixiong commented 11 months ago

Hi Tolga,

Admittedly this step is time-consuming. I typically run this step on my scientific computing platform (SCP). In this case, I submit one job for each cell via Slurm Workload Manager (v20.02.0) on CentOS Linux 7 (Core). Therefore all cells get processed in parallel and it typically takes ~24 hours to complete.

Do you have access to a SCP?

Sean

TdzBAS commented 10 months ago

Hi Sean,

thanks for your reply!

I have access to SGE, and can submit arrayjobs. But unfortunately only one job gets processed at a time. Which is the cause for the massive runtimes. For now I am a bit stucked in how to handle the runtime, so that I can get the results in the next two weeks.

Best Tolga

wenweixiong commented 10 months ago

Hi Tolga,

I have written some scripts to tabulate and compute the intron counts "more efficiently".

The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt), example input (sample1_1.txt, sample1_2.txt), and example output (example_input_sparse.rdata) on Google Drive (link expires this Sunday: https://drive.google.com/file/d/1OMf5C4NjKAE7Ut2e_B65D7BsbWXpJW0R/view?usp=sharing).

We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.

The process will still take time, ~72 hours or more, but it circumvents the need to process one cell at a time.

Sean


From: TdzBAS @.> Sent: 14 August 2023 15:10 To: wenweixiong/MARVEL @.> Cc: Wen, Sean @.>; Comment @.> Subject: Re: [wenweixiong/MARVEL] Computing intron counts with script_03 takes ages (Issue #21)

Hi Sean,

thanks for your reply!

I have access to SGE, and can submit arrayjobs. But unfortunately only one job gets processed at a time. Which is the cause for the massive runtimes. For now I am a bit stucked in how to handle the runtime, so that I can get the results in the next two weeks.

Best Tolga

— Reply to this email directly, view it on GitHubhttps://github.com/wenweixiong/MARVEL/issues/21#issuecomment-1677393109, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKZEX3KNCHNZDKU7T24QQIDXVIWW5ANCNFSM6AAAAAA3F5ZMW4. You are receiving this because you commented.Message ID: @.***>


AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com

Load packages

library(data.table) library(Matrix)

################################################################# ################## RETRIEVE INTRON COORDINATES ################## ################################################################

Read example file

path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/" file <- "sample1_1.txt" df <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))

Provide column names

names(df) <- c("chr", "intron_start", "intron_end", "base", "count")

Create intron coordinate ids

coord.intron <- df$intron_start + df$base coord.intron <- paste(df$chr, coord.intron, sep=":")

################################################################ ################### TABULATE PER BASE COUNTS ################### ################################################################

Define file names

files <- c("sample1_1.txt", "sample1_2.txt")

Retrieve counts

.list <- list()

pb <- txtProgressBar(1, length(files), style=3)

for(i in 1:length(files)) {

# Read example file for 1 sample
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/"
file <- files[i]
df <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))

# Retrieve counts
df <- df[,"V5",drop=FALSE]

# Annotate sample id
sample.id <- gsub(".txt", "", files[i], fixed=TRUE)
names(df) <- sample.id

# Annotate intron counts
df$coord.intron <- coord.intron

# Keep unique values
df <- unique(df)

# Annotate row names
row.names(df) <- df$coord.intron
df$coord.intron <- NULL

# Save into list
.list[[i]] <- df

# Track progress
setTxtProgressBar(pb, i)

}

df <- do.call(cbind.data.frame, .list)

Convert to sparse matrix

df.sparse <- Matrix(as.matrix(df), sparse=TRUE)

Save file

path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/" file <- "example_input_sparse.rdata" save(df.sparse, file=paste(path, file, sep=""))

Load packages

library(Matrix)

Read files

# Example count matrix
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/"
file <- "example_input_sparse.rdata"
df <- local(get(load(paste(path, file, sep=""))))

# Intron coordinates
path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/"
file <- "RI_Coordinates.bed"
bed <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

Tabulate

region.counts.df.list <- list()

pb <- txtProgressBar(1, nrow(bed), style=3)

for(i in 1:nrow(bed)) {

Retrieve intron coordinates

bed.small <- bed[i, ] range <- seq(from=bed.small$upstreamEE + 1, to=bed.small$downstreamES, by=1) coords <- paste(bed.small$chr, range, sep=":")

Compute total counts for each sample

counts.total <- Matrix::colSums(df[coords, ])

Save as data frame

coord.intron <- paste(bed.small$chr, bed.small$upstreamEE+1, bed.small$downstreamES, sep=":") results <- as.data.frame(t(as.data.frame(counts.total))) . <- data.frame("coord.intron"=coord.intron) results <- cbind.data.frame(., results)

Save into list

region.counts.df.list[[i]] <- results

Track progress

setTxtProgressBar(pb, i)

}

region.counts.df <- do.call(rbind.data.frame, region.counts.df.list) region.counts.df <- unique(region.counts.df)

Save file

path <- "/Users/seanwen/Documents/MARVEL/troubleshoot/tolga/" file <- "Counts_by_Region.txt" write.table(region.counts.df, paste(path, file, sep=""), sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)

wenweixiong commented 10 months ago

Hi Tolga,

I have written some scripts to tabulate and compute the intron counts "more efficiently".

The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt) and example input (sample1_1.txt, sample1_2.txt, Google Drive link )

script_01_create_example_file.txt

sample1_1.txt.zip sample1_2.txt.zip

We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.

script_02_tabulate_intron_counts.txt

The process will still take time, but it circumvents the need to process one cell at a time.

Sean

TdzBAS commented 10 months ago

Hi Sean,

Big thanks for your help! I will try it out tomorrow and will let you know how it went!

Best,

Tolga

Sean Wen @.***> schrieb am Mi. 23. Aug. 2023 um 19:53:

Hi Tolga,

I have written some scripts to tabulate and compute the intron counts "more efficiently".

The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt) and example input (sample1_1.txt, sample1_2.txt, Google Drive link )

script_01_create_example_file.txt https://github.com/wenweixiong/MARVEL/files/12421962/script_01_create_example_file.txt

sample1_1.txt.zip https://github.com/wenweixiong/MARVEL/files/12421969/sample1_1.txt.zip sample1_2.txt.zip https://github.com/wenweixiong/MARVEL/files/12421974/sample1_2.txt.zip

We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.

script_02_tabulate_intron_counts.txt https://github.com/wenweixiong/MARVEL/files/12421976/script_02_tabulate_intron_counts.txt

The process will still take time, but it circumvents the need to process one cell at a time.

Sean

— Reply to this email directly, view it on GitHub https://github.com/wenweixiong/MARVEL/issues/21#issuecomment-1690390038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVYUMGFRVUQUU7KXJWE46ZDXWY7RFANCNFSM6AAAAAA3F5ZMW4 . You are receiving this because you authored the thread.Message ID: @.***>

TdzBAS commented 10 months ago

Hi Sean,

just a quick questions about the usage of script1. Is it correct that I must adapt the files list like this, when I have 400 samples, where the first 200 samples belongs to cellgroup a and the last 200 samples belongs to cellgroup b: files <- c("sample1.txt", "sample2.txt", .., "sample399.txt", "sample400.txt") ?

thanks! Best, Tolga

Am Mi., 23. Aug. 2023 um 19:58 Uhr schrieb Tolga @.***>:

Hi Sean,

Big thanks for your help! I will try it out tomorrow and will let you know how it went!

Best,

Tolga

Sean Wen @.***> schrieb am Mi. 23. Aug. 2023 um 19:53:

Hi Tolga,

I have written some scripts to tabulate and compute the intron counts "more efficiently".

The idea is that instead of processing one cell at a time, we first tabulate the per-base counts across all samples into a sparse matrix. Example scripts (script_01_create_example_file.txt) and example input (sample1_1.txt, sample1_2.txt, Google Drive link )

script_01_create_example_file.txt https://github.com/wenweixiong/MARVEL/files/12421962/script_01_create_example_file.txt

sample1_1.txt.zip https://github.com/wenweixiong/MARVEL/files/12421969/sample1_1.txt.zip sample1_2.txt.zip https://github.com/wenweixiong/MARVEL/files/12421974/sample1_2.txt.zip

We then compute the intron counts for each intron by summing up the per-base counts using script_02_tabulate_intron_counts.txt. This should return the final intron count matrix ready to be used by MARVEL as per "Intron count matrix" of the tutorial.

script_02_tabulate_intron_counts.txt https://github.com/wenweixiong/MARVEL/files/12421976/script_02_tabulate_intron_counts.txt

The process will still take time, but it circumvents the need to process one cell at a time.

Sean

— Reply to this email directly, view it on GitHub https://github.com/wenweixiong/MARVEL/issues/21#issuecomment-1690390038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVYUMGFRVUQUU7KXJWE46ZDXWY7RFANCNFSM6AAAAAA3F5ZMW4 . You are receiving this because you authored the thread.Message ID: @.***>

wenweixiong commented 10 months ago

Hi Tolga,

The order of the samples/files shouldn't matter at this stage.

Sean