williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

IRFinder for scRNA-seq data #166

Open me37uday opened 1 year ago

me37uday commented 1 year ago

Hi @dg520 ,

Hope you are doing well :)

I have been trying to use IRFinder on BAM files generated by cellranger to quantify IR. On doing so, I get the following error :

WARN: This sample has excessive splice junctions at unannotated locations. This may indicate the experiment is not actually RNA-Seq. Or it indicates the genome fasta and annotation gtf were not compatible.
WARN: This sample appears to have been prepared with a strand-specific protocol (directional RNA-Seq), however that process was substantially non-successful. A large number of reads appear opposed to the expected direction.

The BAM files contain ~135 million reads with ~35% of the reads confidently mapping to the intronic region. Yet the output of the analysis is very sparse and I get the above warnings. Please let me know what could have gone wrong in this context and if there is anything that could be done to quantify IR appropriately.

The irfinder.stdout is as follows :

IRFinder version: 1.3.1
IRFinder start:  Mon Oct 17 17:33:07 CEST 2022
IRFinder runmode: BAM
IRFinder user@host: urangasw @ valis-01
IRFinder working dir:  /home/urangasw/data/sc_RNA_seq_microglia/cr_output/Microglia_MO_AD1_cr/outs
IRFinder reference: /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106
IRFinder file 1: possorted_genome_bam.bam
IRFinder file 2:
---
IRFinder run with options:
Output Dir      ir_finder/
Main intron reference:  /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106/IRFinder/ref-cover.bed
Read spans reference:   /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106/IRFinder/ref-read-continues.ref
Optional ROI reference: /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106/IRFinder/ref-ROI.bed
Optional BAM output:    NULL
ERR-Total reads processed: 197222455
ERR-Total nucleotides: 9035297162
ERR-Total singles processed: 165342092
ERR-Total pairs processed: 0
ERR-Short pairs: 0
ERR-Intersect pairs: 0
ERR-Long pairs: 0
ERR-Skipped reads: 31880363
ERR-Error reads: 0
Directionality: Dir evidence:   25232
Directionality: Nondir evidence:        237
Directionality: Dir evidence known junctions:   3
Directionality: Nondir evidence known junctions:        0
Directionality: Dir matches ref:        3
Directionality: Dir opposed to ref:     0
Directionality: Dir score all (0-10000):        9906
Directionality: Dir score known junctions (0-10000):    7500
RNA-Seq directionality -1/0/+1: 0

It would be great if you could shed some light on this context. Thanks in advance!

Best, Uday

dg520 commented 1 year ago

@me37uday Interesting observation from 10X scRNA. These two warnings are highly related. When you have the second one, you will also have the first one for sure. As IRFinder is designed for bulk RNASeq, there are some obvious compatibility issues on scRNA. I will go through the potential cause of the warnings you saw and try to provide a potential workaround. But no guarantee that they work out-of-the-box, as I have never tried IRFinder on scRNA so far.

Cause:

  1. The possorted_genome_bam.bam from CellRanger is actually single-end. This is because only one end from FASTQ is used for mapping while the other end, containing cell-barcode and UMI, is ignored. According to your library preparation, all the mapped reads are mapped to the opposite strand of the actual transcripts, which is totally normal and expected.
  2. However, this unnatural "single-end" BAM confuses IRFinder's directionality determination process. By default, IRFinder thinks all the reads should be mapped to the same strand of the actual transcripts when it sees a single-end input.
  3. This leads to the second warning you saw and further triggers the first warning.

Workaround Try to manually curate a new BAM from possorted_genome_bam.bam, where you want to minus 16 from every SAM flag to pretend the mapped reads are from the forward stand. And use this new BAM to re-run IRFinder.

Finally, something thoughts about the sparsity of the output matrix. I think this could be due to:

  1. the problem above
  2. the fact that 10X libraries only cover either the very 5' or 3' end of genes, depending on which chemistry is used. On the other hand, IRFinder always generates the matrix for all introns built from the reference. It won't be surprising to me that most lines are purely zeros for scRNA, even if IRFinder runs properly.

Please post back the irfinder.stdout again if things don't work out.

me37uday commented 1 year ago

Hi @dg520 ,

Thanks for your valuable insight!

I have a follow up question, should I minus 16 from every read or simply replace 16 with 0 to pretend the mapped reads are from the forward strand?

The following are the flags and the number of reads having those flags :

flag    occurrences
0       22776239
4       31880363
16      22305740
1024    70268918
1040    49991195

Let me know what you think of the approach. Thanks!

Uday

dg520 commented 1 year ago

@me37uday Totally. You are right and I was wrong. For the first trial, let's replate 16 with 0 and 1040 with 1024. You might also want to compare this result to the second trial, where you would flip the original 0 and 16, and flip the original 1024 with 1040.

me37uday commented 1 year ago

Hi @dg520 ,

So, I replaced 16 with 0 and 1040 with 1024 and created a new BAM file. I get the following numbers :

flag    occurrences
0       31763100
4       5428585
1024    99550691

I get the following when I run infer_experiment.py :

This is SingleEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "++,--": 0.0304
Fraction of reads explained by "+-,-+": 0.9696

Indicating that the change of 16 to 0 seemed to have worked. However, on running IRFinder on this new BAM file, I still get the same error. Here's the content of irfinder.stdout :

IRFinder version: 1.3.1
IRFinder start:  Wed Oct 26 17:23:09 CEST 2022
IRFinder runmode: BAM
IRFinder user@host: urangasw @ valis-02
IRFinder working dir:  /home/urangasw/data/sc_RNA_seq_microglia/cr_output/Microglia_MO_AD3_cr/outs
IRFinder reference: /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106
IRFinder file 1: possorted_genome_bam_1.bam
IRFinder file 2:
---
IRFinder run with options:
Output Dir      ir_finder_1/
Main intron reference:  /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106/IRFinder/ref-cover.bed
Read spans reference:   /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106/IRFinder/ref-read-continues.ref
Optional ROI reference: /home/urangasw/data/bash_scripts/IR/REF/Human-GRCh38-release106/IRFinder/ref-ROI.bed
Optional BAM output:    NULL
ERR-Total reads processed: 136742376
ERR-Total nucleotides: 12682686954
ERR-Total singles processed: 131313791
ERR-Total pairs processed: 0
ERR-Short pairs: 0
ERR-Intersect pairs: 0
ERR-Long pairs: 0
ERR-Skipped reads: 5428585
ERR-Error reads: 0
Directionality: Dir evidence:   36969
Directionality: Nondir evidence:        0
Directionality: Dir evidence known junctions:   8
Directionality: Nondir evidence known junctions:        0
Directionality: Dir matches ref:        2
Directionality: Dir opposed to ref:     6
Directionality: Dir score all (0-10000):        9999
Directionality: Dir score known junctions (0-10000):    8888
RNA-Seq directionality -1/0/+1: 0

I was not expecting those warnings having done this modification of the flags. What could have gone wrong? Do you think I should have set the flag to 16 instead of 0? Please let me know.

Cheers, Uday

dg520 commented 1 year ago

@me37uday I am out of town for a conference. Will be back next week and take a closer look into this. Please just hold on.

me37uday commented 1 year ago

Have a good one!

dg520 commented 1 year ago

@me37uday After switching the SAM flags, we did have much more reads to support the directionality, indicated by the number at the end of the line "Directionality: Dir evidence:" substantially increased. However, the number at the end of the line "Directionality: Dir evidence known junctions:" has only had a neglectable change. This obviously suggests the junction annotation provided to IRFinder (i.e. GTF used to prepare IRFinder reference) does NOT match with what STAR alignment sees.

Please double check the GTF used for CellRanger and IRFinder reference building are the same, including: 1) they must refer to the same species (e.g. both human, or both mouse) 2) they must have the same chromosome nomenclatures. (e.g. Both have chromosome names starting with "chr", or neither should start with "chr"); 3) they should also be under the same version/release (e.g. both according to Ensemble GRCh38.100).

me37uday commented 1 year ago

Hi @dg520 ,

Thanks for your suggestion! Turns out, that was the problem. I was using a different genome version for cellranger and IRFinder. And the nomenclatures were different as well.

Now when I run IRFinder without altering the flags, I don't get any of those warnings. However, the output remains the same even when I modify the flags. Any clarity on this please?

Uday

ChristyGault commented 6 months ago

Was there ever a solution to using IRFinder on scRNA-seq data? I would also like to do this.

dg520 commented 6 months ago

@ChristyGault Sorry but no, we currently have no plan to implement IRFinder on scRNA. Partially because we mainly work on 10X scRNA, which is unsuitable for IR measurement.

ChristyGault commented 1 month ago

Just a note to future users- IRFinder works on scRNA-seq data generated with SMART-seq kits. I had to run it in FASTQ mode because there was a mismatch between the STAR reference that was used to generate the BAM files and the STAR reference that IRFinder generated.