yarden / MISO

MISO: Mixture of Isoforms model for RNA-Seq isoform quantitation
http://genes.mit.edu/burgelab/miso/index.html
132 stars 74 forks source link

Reads thrown out? #62

Open olgabot opened 10 years ago

olgabot commented 10 years ago

Hello there, It's unclear to me why some reads get thrown out of the MISO calculation, and where the counts are coming from. We already did filtering on what reads we think are good at the mapping step, so when I look at the junction on IGV, there's more reads than are reported. Can you describe the criteria for throwing out reads? Thanks, Olga

P.S. Chris Burge suggested I ask - he's visiting us in SD now :)

yarden commented 10 years ago

Hi Olga,

It's hard to say without an example. In general, there are several reasons get thrown out, but the filtering depends on the parameters you feed MISO. Whether it's a paired- or single-end run or stranded or unstranded will change things.

PE reads that are not properly paired within locus of interest are thrown out too. Older versions enabled overhang constraints. Also keep in mind that 'counts' fields are counting read pairs in PE and single reads in SE.

Anyway, it'll be easiest if you send me a GFF file for an event of interest and the BAM reads for that file. If you let me know which reads you expect to be included I can see why they're not.

Best, --Yarden

PS Sounds like a splicing party. Say hi to Chris :)

olgabot commented 10 years ago

Here's a bam file of just the HNRNPA1 region: link

And here's the output I get for one of the events:

chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+   0.00    0.00    0.01    'chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.up_chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.se_chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.dn','chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.up_chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.dn'     (0,0):739,(0,1):107,(1,1):14     0:0,1:121       chr12   +       54676584,54676584       54677751,54677751

I ran this with --no-bam-filter for both insert len, and without a threshold for minimum counts.

As you can see, the raw counts are (0,0):739,(0,1):107,(1,1):14 but when I look at the region in IGV I see 1247 (+)-strand junction reads and 457 (-)-strand reads, so there should be at least 1704 reads supporting the exclusion isoform. But where did they go?

yarden commented 10 years ago

Can you please tell me what command you ran exactly, i.e. what parameters? You mentioned insert len - if you ran it in paired-end mode, let me know what the insert len distribution parameters you fed it were, since I can't estimate that just from the hnRNPA1 reads.

I see that this event is:

chr12   SE  gene    54676584    54677751    .   +   .   Name=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;refseq_id=NM_002136,NM_031157;ensg_id=ENSG00000135486,ENSG00000258344;gsymbol=HNRNPA1,RP11-968A15.8;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+
chr12   SE  mRNA    54676584    54677751    .   +   .   gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+
chr12   SE  exon    54676584    54676658    .   +   .   gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.up;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A
chr12   SE  exon    54676863    54677018    .   +   .   gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.se;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A
chr12   SE  exon    54677596    54677751    .   +   .   gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.dn;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A
chr12   SE  mRNA    54676584    54677751    .   +   .   gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+
chr12   SE  exon    54676584    54676658    .   +   .   gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.up;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B
chr12   SE  exon    54677596    54677751    .   +   .   gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.dn;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B

When I run this in simple single-end mode, as in:

$ miso --run ./indexed_hnrnpa1/ M1_01.hnrnpa1.bam --read-len 92 --output-dir ./se_miso

(ignore the new miso command-line syntax; that's coming up in a new release to simplify things, but not relevant to this.)

on the GFF event above, it gives me the following counts:

counts=(0,0):3200,(0,1):1365,(1,1):1022 

which seems consistent with the IGV view, see attached, where I loaded the BAM file and the GFF file with the event above.

olga_hnrnpa1

olgabot commented 10 years ago

Here's the full command. I ran it in paired-end mode, and here's the first line of the insert_len file:

#mean=254.8,sdev=83.8,dispersion=5.3,num_pairs=449856

It seems like the single-end mode uses all the reads, but the paired-end mode gets a bunch of them thrown out.

READ_LEN=$(samtools view ./M1_01.bam | head -n 1 | cut -f 10 | awk '{ print length }')

# Calculate insert size
python /home/yeo-lab/software/bin/pe_utils.py --compute-insert-len ./M1_01.bam /projects/ps-yeolab/genomes/hg19/miso_annotations/SE_constitutive/SE.hg19.min_20.const_exons.gff --no-bam-filter --output-dir ./
INSERT_LEN_MEAN=$(head -n 1 ./M1_01.bam.insert_len | sed 's:#::' | cut -d',' -f1 | cut -d'=' -f2)
INSERT_LEN_STDDEV=$(head -n 1 ./M1_01.bam.insert_len | sed 's:#::' | cut -d',' -f2 | cut -d'=' -f2)

# calculate Psi scores for all SE events
mkdir -p ./M1_01/SE
python /home/yeo-lab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso_annotations/SE_index ./M1_01.bam --output-dir ./miso/M1_01/SE --read-len $READ_LEN  --paired-end $INSERT_LEN_MEAN $INSERT_LEN_STDDEV  --no-filter-events -p 16  > ./miso/M1_01/SE/psi.out 2> ./miso/M1_01/SE/psi.err
yarden commented 10 years ago

My thinking is that this is caused by the insert length parameters that were fed in. Those insert length parameters are most certainly either wrong, or the library size-selection was so poor that it should not be processed in paired-end mode for MISO. The current parameters say the standard deviation is roughly %40 of the mean -- seems far too large.

Based on your filename, it looks like you passed pe_utils.py virtually all constitutive exons (only minimum size of 20 based on filename, SE.hg19.min_20.const_exons.gff), which will distort the calculation. The constitutive exon must be on average much longer than than the insert length for the calculation to be reliable, otherwise it's confounded by alternative splicing. I'd try first recalculating that by making a new GFF, using exon_utils, which takes a minimum exon size of 500 bp to be conservative (basically selects for constitutive 3' UTRs, mostly). Then see if those insert length parameters make a difference. If you run into more problems with this, please send me the BAM so I can calculate it the insert-length and then run it like you did (can't get the insert-length from just the hnRNPA1 locus reads).