Open olgabot opened 10 years ago
Hi Olga,
It's hard to say without an example. In general, there are several reasons get thrown out, but the filtering depends on the parameters you feed MISO. Whether it's a paired- or single-end run or stranded or unstranded will change things.
PE reads that are not properly paired within locus of interest are thrown out too. Older versions enabled overhang constraints. Also keep in mind that 'counts' fields are counting read pairs in PE and single reads in SE.
Anyway, it'll be easiest if you send me a GFF file for an event of interest and the BAM reads for that file. If you let me know which reads you expect to be included I can see why they're not.
Best, --Yarden
PS Sounds like a splicing party. Say hi to Chris :)
Here's a bam file of just the HNRNPA1 region: link
And here's the output I get for one of the events:
chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+ 0.00 0.00 0.01 'chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.up_chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.se_chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.dn','chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.up_chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.dn' (0,0):739,(0,1):107,(1,1):14 0:0,1:121 chr12 + 54676584,54676584 54677751,54677751
I ran this with --no-bam-filter for both insert len, and without a threshold for minimum counts.
As you can see, the raw counts are (0,0):739,(0,1):107,(1,1):14
but when I look at the region in IGV I see 1247 (+)-strand junction reads and 457 (-)-strand reads, so there should be at least 1704 reads supporting the exclusion isoform. But where did they go?
Can you please tell me what command you ran exactly, i.e. what parameters? You mentioned insert len - if you ran it in paired-end mode, let me know what the insert len distribution parameters you fed it were, since I can't estimate that just from the hnRNPA1 reads.
I see that this event is:
chr12 SE gene 54676584 54677751 . + . Name=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;refseq_id=NM_002136,NM_031157;ensg_id=ENSG00000135486,ENSG00000258344;gsymbol=HNRNPA1,RP11-968A15.8;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+
chr12 SE mRNA 54676584 54677751 . + . gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+
chr12 SE exon 54676584 54676658 . + . gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.up;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A
chr12 SE exon 54676863 54677018 . + . gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.se;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A
chr12 SE exon 54677596 54677751 . + . gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A.dn;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.A
chr12 SE mRNA 54676584 54677751 . + . gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+
chr12 SE exon 54676584 54676658 . + . gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.up;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B
chr12 SE exon 54677596 54677751 . + . gid=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+;ID=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B.dn;Parent=chr12:54676584:54676658:+@chr12:54676863:54677018:+@chr12:54677596:54677751:+.B
When I run this in simple single-end mode, as in:
$ miso --run ./indexed_hnrnpa1/ M1_01.hnrnpa1.bam --read-len 92 --output-dir ./se_miso
(ignore the new miso
command-line syntax; that's coming up in a new release to simplify things, but not relevant to this.)
on the GFF event above, it gives me the following counts:
counts=(0,0):3200,(0,1):1365,(1,1):1022
which seems consistent with the IGV view, see attached, where I loaded the BAM file and the GFF file with the event above.
Here's the full command. I ran it in paired-end mode, and here's the first line of the insert_len
file:
#mean=254.8,sdev=83.8,dispersion=5.3,num_pairs=449856
It seems like the single-end mode uses all the reads, but the paired-end mode gets a bunch of them thrown out.
READ_LEN=$(samtools view ./M1_01.bam | head -n 1 | cut -f 10 | awk '{ print length }')
# Calculate insert size
python /home/yeo-lab/software/bin/pe_utils.py --compute-insert-len ./M1_01.bam /projects/ps-yeolab/genomes/hg19/miso_annotations/SE_constitutive/SE.hg19.min_20.const_exons.gff --no-bam-filter --output-dir ./
INSERT_LEN_MEAN=$(head -n 1 ./M1_01.bam.insert_len | sed 's:#::' | cut -d',' -f1 | cut -d'=' -f2)
INSERT_LEN_STDDEV=$(head -n 1 ./M1_01.bam.insert_len | sed 's:#::' | cut -d',' -f2 | cut -d'=' -f2)
# calculate Psi scores for all SE events
mkdir -p ./M1_01/SE
python /home/yeo-lab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso_annotations/SE_index ./M1_01.bam --output-dir ./miso/M1_01/SE --read-len $READ_LEN --paired-end $INSERT_LEN_MEAN $INSERT_LEN_STDDEV --no-filter-events -p 16 > ./miso/M1_01/SE/psi.out 2> ./miso/M1_01/SE/psi.err
My thinking is that this is caused by the insert length parameters that were fed in. Those insert length parameters are most certainly either wrong, or the library size-selection was so poor that it should not be processed in paired-end mode for MISO. The current parameters say the standard deviation is roughly %40 of the mean -- seems far too large.
Based on your filename, it looks like you passed pe_utils.py
virtually all constitutive exons (only minimum size of 20 based on filename, SE.hg19.min_20.const_exons.gff
), which will distort the calculation. The constitutive exon must be on average much longer than than the insert length for the calculation to be reliable, otherwise it's confounded by alternative splicing. I'd try first recalculating that by making a new GFF, using exon_utils
, which takes a minimum exon size of 500 bp to be conservative (basically selects for constitutive 3' UTRs, mostly). Then see if those insert length parameters make a difference. If you run into more problems with this, please send me the BAM so I can calculate it the insert-length and then run it like you did (can't get the insert-length from just the hnRNPA1 locus reads).
Hello there, It's unclear to me why some reads get thrown out of the MISO calculation, and where the
counts
are coming from. We already did filtering on what reads we think are good at the mapping step, so when I look at the junction on IGV, there's more reads than are reported. Can you describe the criteria for throwing out reads? Thanks, OlgaP.S. Chris Burge suggested I ask - he's visiting us in SD now :)