yarden / MISO

MISO: Mixture of Isoforms model for RNA-Seq isoform quantitation
http://genes.mit.edu/burgelab/miso/index.html
132 stars 74 forks source link

Properly paired on tophat output #60

Closed frontal closed 10 years ago

frontal commented 11 years ago

Shalom Yarden,

I am using BAM generated by tophat2 (with bowtie2), and I found out the only 50% is properly paired under flagstat analysis.

However, considering data is RNA I`m not sure this is bad, as Mate Position can be significantly far from first read coordinate.

I see MISO consider this flag. Should I try to rerun tophat2 with better paramters (-r -std) to get better %, or should I just cancel the flag usage in the setting.txt file in MISO ?

What is the meaning of cancelling this feature on the setting ? what disadvantages it hold ? And also, is tophat2 aware of the intron spanning in the mate coordinate when assigning the properly paired flag ?

yarden commented 10 years ago

Hi,

Apologies for delayed replies on all your requests, was defending thesis and swamped with related things. I think 50% properly paired is quite low, so if you can fix that upstream in Tophat that would be ideal, since some of the parameters might be off which is causing this. Alternatively, you can preprocess your BAM to pair the reads by their read ID only (so set the BAM flag yourself) if you think Tophat is incorrectly calling reads as unpaired when they are infact unpaired.

I'm not sure which MISO flag you're referring to in settings.txt? Can you explain?

Older versions of Tophat used to consider the insert length you feed it when calling mate pairs, but I believe now the insert length is no longer required as input parameter, if I remember correctly, and it does something else to figure out how to properly pair the mates. --Yarden

frontal commented 10 years ago

Toda Raba for your help. I appreciate your efforts and time.

I was able by now to figure tophat methods (more or less) on proper pairing call and I found out that the parameters still matter. It seems tophat looks for the inner distance (not TLEN) to determine proper pairing. It is very restrict by default allowing only couple of dozens bases for the reads to be apart (Read1 right to Read2 left) and any pair with big inner distance is considered not proper even though the reads are oriented and just hundreds bases apart. If those parameters given to tophat increased by user, then the proper pair % rises. (though I think better for Miso to keep them low and restrict, and not to expand the insert-length distribution wider).

*Regarding settings.txt, I saw one post by you saying removing filtering will cause identification of reads by ID only. I was able to raise proper-pair % on my BAM by removing multi-hits reads and leaving only primary. I guess my problem laid there, and please ignore that question.

Being that said and asked, I was hoping you could assist me with a bigger problem I`m facing - In our condition we believe not only Cassette Exons are influenced but CS as well.

I have created a new GFF3 file holding 140k events of all possible Exon-Trios known, (I have used TXdbgen by SpliceTrap) and then I did the same with constitutive Exons only (By choosing some rules for exons to be CS). I fed Miso with these both annotations.

Miso's output for Wild-Type showed many alternative splice events than expected, around 40k alt , out of 140k trios (under the all-possible-trios file), same proportion appeared on the 'CS-Exons-only' file.

Example: PSI=0.18 ; Counts=(0,0):33,(1,0):1,(1,1):30 ; Assigned Counts = 0:2,1:29

My conclusion was that Miso cannot handle CS exons, since it was meant to deal with Cassette Exons only. (?) Perhaps Miso expect an exclusion isoform to be present in all cases and therefore computes PSI for CS incorrectly.

Can you share your thoughts ? Anyway to run Miso on all Exons not only Alt' to calculate PSI for all ?

Thanks again for your time and patience.

yarden commented 10 years ago

Hi,

I'm not sure I follow the question. What is 'CS'?

MISO is not specific to Cassette Exons; it will estimate the expression of any set of isoforms that are in your GFF file. If you feed it a constitutive exon trio (i.e. a trio containing exons that are always spliced in, never spliced out) then the PSI values can still be not exactly 0 or 1 because of the way read densities in flanking exons are used. If you have no evidence in the data that a particular exon trio is alternative (i.e. none of the samples ever show even a single exclusion read), there is in my view little point in including that trio in your analysis, as it is most likely a constitutive exon. In the example you showed, there are no reads in the "(0,1)" category, i.e. no reads that are not consistent with the first isoform and consistent with the second, so the Psi value should be much closer to 0 than to 1, which it is. Note that the isoform that defines the Psi value is the one that appears first in the GFF file. If you reverse this order for the example you showed and re-run MISO, your Psi value would be 1-0.18=0.82

Best, --Yarden