yyoshiaki / VIRTUS2

A bioinformatics pipeline for viral transcriptome detection and quantification considering splicing.
Other
18 stars 7 forks source link

Questions regarding VIRTUS2 script for transcript alignment to viral reference genome + calculating num_reads #32

Closed ppark123 closed 11 months ago

ppark123 commented 1 year ago

Hello, I have a few questions regarding the script you wrote for the pipeline. Recently, I took my virusAligned.filtered.sortedByCoord.out.bam and tried to map these transcripts onto one of the reference genomes that were in my VIRTUS2 output using STAR. However, I realized that there were discrepancies between the numreads that align to the reference genome between the TSV output for that particular virus in VIRTUS2 and the output I get as the STAR output. I was confused on why this would be so I looked back at how you wrote your scripts for the alignments and I have a few questions about them.

1) Is there a reason why you chose to do a sequential alignment rather than a combined alignment for this dual RNA-seq?

2) Following up to this question, is there a reason why you use STAR instead of Bowtie to align the unmapped transcripts to the viral reference genomes?

3) You set the default --genomeSAindexNbases for the second alignment to be 12. The input is a single Fasta file of all the viral genomes, but since --genomeSAindexNbases is set to 12, wouldn't that be too big since this would apply to each viral genome separately? Or is there a code where you adjust the --genomeSAindexNbases for each of the viral reference genome based on their size?

4) Also in your count there is a code like this: num_reads = int(df_STARLog.loc[' Uniquely mapped reads number |', 1]) + \ int(df_STARLog.loc[' Number of reads mapped to multiple loci |', 1]) Is there a reason why numreads isn't equal to just the uniquely mapped reads number but the sum of the uniquely mapped reads + number of reads mapped to multiple loci?

It would be really help with understanding the overall flow of how VIRTUS2 works, if you could take the time to answer these questions. Thank you in advance :)

yyoshiaki commented 11 months ago

Sorry for my late reply,

  1. Is there a reason why you chose to do a sequential alignment rather than a combined alignment for this dual RNA-seq? To stringently remove human/host-derived reads. For viral read quantification, false-positive ia a huge problem. By this strategy, we discard ambiguous reads.

  2. Following up to this question, is there a reason why you use STAR instead of Bowtie to align the unmapped transcripts to the viral reference genomes? It is because viral transcripts are sometimes spliced.

  3. You set the default --genomeSAindexNbases for the second alignment to be 12. The input is a single Fasta file of all the viral genomes, but since --genomeSAindexNbases is set to 12, wouldn't that be too big since this would apply to each viral genome separately? Or is there a code where you adjust the --genomeSAindexNbases for each of the viral reference genome based on their size? Unfortunatly no. It is possible to edit cwl files directly.

  4. Also in your count there is a code like this: num_reads = int(df_STARLog.loc[' Uniquely mapped reads number |', 1]) + \ int(df_STARLog.loc[' Number of reads mapped to multiple loci |', 1]) Is there a reason why numreads isn't equal to just the uniquely mapped reads number but the sum of the uniquely mapped reads + number of reads mapped to multiple loci? To increase sensitivity, VIRTUS2 is counting both uniquely mapped reads and multi mapped reads on viral genomes.

ppark123 commented 11 months ago

Thank you for the taking the time to answer the questions!