yarden / MISO

MISO: Mixture of Isoforms model for RNA-Seq isoform quantitation
http://genes.mit.edu/burgelab/miso/index.html
132 stars 74 forks source link

GFF IDs don't correspond to exon start/stops #55

Open olgabot opened 11 years ago

olgabot commented 11 years ago

Hello Yarden et al, This is more of a feature request than a bug.

I'm trying to understand the ID scheme of the provided gff3 files. It says in the documentation as an example, that the ID of one SE entry was "arbitrarily" chosen to be the coordinates of the 5' upstream exon, the SE itself, and its 3' downstream exon. However, I don't see this in the current SE.hg19.gff3 file:

(obot_virtualenv)[obotvinnik@tscc-login2 prettyplotlib]$ grep exon ~/genomes/miso_annotations/hg19/SE.hg19.gff3 | head
chr1    SE      exon    16854   17055   .       -       .       ID=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.A.dn;Parent=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.A;gid=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-
chr1    SE      exon    17233   17742   .       -       .       ID=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.A.se;Parent=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.A;gid=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-
chr1    SE      exon    17915   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.A.up;Parent=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.A;gid=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-
chr1    SE      exon    16854   17055   .       -       .       ID=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.B.dn;Parent=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.B;gid=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-
chr1    SE      exon    17915   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.B.up;Parent=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.B;gid=chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-
chr1    SE      exon    17233   17368   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.A.dn;Parent=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.A;gid=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-
chr1    SE      exon    17606   17742   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.A.se;Parent=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.A;gid=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-
chr1    SE      exon    17915   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.A.up;Parent=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.A;gid=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-
chr1    SE      exon    17233   17368   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.B.dn;Parent=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.B;gid=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-
chr1    SE      exon    17915   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.B.up;Parent=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-.B;gid=chr1:7778:7924:-@chr1:7469:7605:-@chr1:7096:7231:-

For example, the first exon is on the negative strand and has a start and stop of 16854 and 17055. However, its ID is chr1:7778:7924:-@chr1:7096:7605:-@chr1:6717:6918:-.A.dn, which doesn't include either of those numbers!

This has been especially confusing when attempting to interpret MISO output, and going to the middle chromosome location in the ID, and finding no reads there. But the location specified by the start and stop columns in the original .gff3 file are correct, (which is comforting) but it's kind of a pain to have to grep for this arbitrary ID every time.

Is it possible for these .gff3 files to be updated such that the ID matches the chromosome location?

FWIW, this seems to also be an issue in A3SS.hg19.gff3:

(obot_virtualenv)[obotvinnik@tscc-login2 prettyplotlib]$ grep exon ~/genomes/miso_annotations/hg19/A3SS.hg19.gff3 | head
chr1    A3SS    exon    15796   15947   .       -       .       ID=chr1:6470:6628:-@chr1:5805|5810:5659:-.A.coreAndExt;Parent=chr1:6470:6628:-@chr1:5805|5810:5659:-.A;gid=chr1:6470:6628:-@chr1:5805|5810:5659:-
chr1    A3SS    exon    16607   16765   .       -       .       ID=chr1:6470:6628:-@chr1:5805|5810:5659:-.A.up;Parent=chr1:6470:6628:-@chr1:5805|5810:5659:-.A;gid=chr1:6470:6628:-@chr1:5805|5810:5659:-
chr1    A3SS    exon    15796   15942   .       -       .       ID=chr1:6470:6628:-@chr1:5805|5810:5659:-.B.core;Parent=chr1:6470:6628:-@chr1:5805|5810:5659:-.B;gid=chr1:6470:6628:-@chr1:5805|5810:5659:-
chr1    A3SS    exon    16607   16765   .       -       .       ID=chr1:6470:6628:-@chr1:5805|5810:5659:-.B.up;Parent=chr1:6470:6628:-@chr1:5805|5810:5659:-.B;gid=chr1:6470:6628:-@chr1:5805|5810:5659:-
chr1    A3SS    exon    17233   17742   .       -       .       ID=chr1:7778:7924:-@chr1:7231|7605:7096:-.A.coreAndExt;Parent=chr1:7778:7924:-@chr1:7231|7605:7096:-.A;gid=chr1:7778:7924:-@chr1:7231|7605:7096:-
chr1    A3SS    exon    17915   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7231|7605:7096:-.A.up;Parent=chr1:7778:7924:-@chr1:7231|7605:7096:-.A;gid=chr1:7778:7924:-@chr1:7231|7605:7096:-
chr1    A3SS    exon    17233   17368   .       -       .       ID=chr1:7778:7924:-@chr1:7231|7605:7096:-.B.core;Parent=chr1:7778:7924:-@chr1:7231|7605:7096:-.B;gid=chr1:7778:7924:-@chr1:7231|7605:7096:-
chr1    A3SS    exon    17915   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7231|7605:7096:-.B.up;Parent=chr1:7778:7924:-@chr1:7231|7605:7096:-.B;gid=chr1:7778:7924:-@chr1:7231|7605:7096:-
chr1    A3SS    exon    18268   18379   .       -       .       ID=chr1:8776:14754:-@chr1:8232|8242:8131:-.A.coreAndExt;Parent=chr1:8776:14754:-@chr1:8232|8242:8131:-.A;gid=chr1:8776:14754:-@chr1:8232|8242:8131:-
chr1    A3SS    exon    18913   24891   .       -       .       ID=chr1:8776:14754:-@chr1:8232|8242:8131:-.A.up;Parent=chr1:8776:14754:-@chr1:8232|8242:8131:-.A;gid=chr1:8776:14754:-@chr1:8232|8242:8131:-

A5SS.hg19.gff3:

(obot_virtualenv)[obotvinnik@tscc-login2 prettyplotlib]$ grep exon ~/genomes/miso_annotations/hg19/A5SS.hg19.gff3 | head
chr1    A5SS    exon    17233   17368   .       -       .       ID=chr1:7605:7469|7389:-@chr1:7096:7231:-.A.dn;Parent=chr1:7605:7469|7389:-@chr1:7096:7231:-.A;gid=chr1:7605:7469|7389:-@chr1:7096:7231:-
chr1    A5SS    exon    17526   17742   .       -       .       ID=chr1:7605:7469|7389:-@chr1:7096:7231:-.A.coreAndExt;Parent=chr1:7605:7469|7389:-@chr1:7096:7231:-.A;gid=chr1:7605:7469|7389:-@chr1:7096:7231:-
chr1    A5SS    exon    17233   17368   .       -       .       ID=chr1:7605:7469|7389:-@chr1:7096:7231:-.B.dn;Parent=chr1:7605:7469|7389:-@chr1:7096:7231:-.B;gid=chr1:7605:7469|7389:-@chr1:7096:7231:-
chr1    A5SS    exon    17606   17742   .       -       .       ID=chr1:7605:7469|7389:-@chr1:7096:7231:-.B.core;Parent=chr1:7605:7469|7389:-@chr1:7096:7231:-.B;gid=chr1:7605:7469|7389:-@chr1:7096:7231:-
chr1    A5SS    exon    16854   17055   .       -       .       ID=chr1:7924:7778|7469:-@chr1:6717:6918:-.A.dn;Parent=chr1:7924:7778|7469:-@chr1:6717:6918:-.A;gid=chr1:7924:7778|7469:-@chr1:6717:6918:-
chr1    A5SS    exon    17606   18061   .       -       .       ID=chr1:7924:7778|7469:-@chr1:6717:6918:-.A.coreAndExt;Parent=chr1:7924:7778|7469:-@chr1:6717:6918:-.A;gid=chr1:7924:7778|7469:-@chr1:6717:6918:-
chr1    A5SS    exon    16854   17055   .       -       .       ID=chr1:7924:7778|7469:-@chr1:6717:6918:-.B.dn;Parent=chr1:7924:7778|7469:-@chr1:6717:6918:-.B;gid=chr1:7924:7778|7469:-@chr1:6717:6918:-
chr1    A5SS    exon    17915   18061   .       -       .       ID=chr1:7924:7778|7469:-@chr1:6717:6918:-.B.core;Parent=chr1:7924:7778|7469:-@chr1:6717:6918:-.B;gid=chr1:7924:7778|7469:-@chr1:6717:6918:-
chr1    A5SS    exon    14363   16765   .       -       .       ID=chr1:6918:6739|6721:-@chr1:4226:6628:-.A.dn;Parent=chr1:6918:6739|6721:-@chr1:4226:6628:-.A;gid=chr1:6918:6739|6721:-@chr1:4226:6628:-
chr1    A5SS    exon    16858   17055   .       -       .       ID=chr1:6918:6739|6721:-@chr1:4226:6628:-.A.coreAndExt;Parent=chr1:6918:6739|6721:-@chr1:4226:6628:-.A;gid=chr1:6918:6739|6721:-@chr1:4226:6628:-

MXE.hg19.gff3:

(obot_virtualenv)[obotvinnik@tscc-login2 prettyplotlib]$ grep exon ~/genomes/miso_annotations/hg19/MXE.hg19.gff3 | head
chr1    MXE     exon    764383  764484  .       +       .       ID=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.A.up;Parent=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.A;gid=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+
chr1    MXE     exon    776580  776753  .       +       .       ID=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.A.mxe1;Parent=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.A;gid=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+
chr1    MXE     exon    787307  788090  .       +       .       ID=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.A.dn;Parent=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.A;gid=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+
chr1    MXE     exon    764383  764484  .       +       .       ID=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.B.up;Parent=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.B;gid=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+
chr1    MXE     exon    783034  783186  .       +       .       ID=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.B.mxe2;Parent=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.B;gid=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+
chr1    MXE     exon    787307  788090  .       +       .       ID=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.B.dn;Parent=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+.B;gid=chr1:754246:754347:+@chr1:766443:766616:+@chr1:772897:773049:+@chr1:777170:777953:+
chr1    MXE     exon    1027371 1027483 .       -       .       ID=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.B.dn;Parent=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.B;gid=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-
chr1    MXE     exon    1041336 1041429 .       -       .       ID=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.B.mxe2;Parent=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.B;gid=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-
chr1    MXE     exon    1051440 1051736 .       -       .       ID=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.B.up;Parent=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.B;gid=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-
chr1    MXE     exon    1027371 1027483 .       -       .       ID=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.A.dn;Parent=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-.A;gid=chr1:1041303:1041599:-@chr1:1040265:1040318:-@chr1:1031199:1031292:-@chr1:1017234:1017346:-

RI.hg19.gff3:

(obot_virtualenv)[obotvinnik@tscc-login2 prettyplotlib]$ grep exon ~/genomes/miso_annotations/hg19/RI.hg19.gff3 | head
chr1    RI      exon    17233   17742   .       -       .       ID=chr1:7464:7605:-@chr1:7096:7227:-.A.withRI;Parent=chr1:7464:7605:-@chr1:7096:7227:-.A;gid=chr1:7464:7605:-@chr1:7096:7227:-
chr1    RI      exon    17233   17364   .       -       .       ID=chr1:7464:7605:-@chr1:7096:7227:-.B.dn;Parent=chr1:7464:7605:-@chr1:7096:7227:-.B;gid=chr1:7464:7605:-@chr1:7096:7227:-
chr1    RI      exon    17601   17742   .       -       .       ID=chr1:7464:7605:-@chr1:7096:7227:-.B.up;Parent=chr1:7464:7605:-@chr1:7096:7227:-.B;gid=chr1:7464:7605:-@chr1:7096:7227:-
chr1    RI      exon    17233   17742   .       -       .       ID=chr1:7469:7605:-@chr1:7096:7231:-.A.withRI;Parent=chr1:7469:7605:-@chr1:7096:7231:-.A;gid=chr1:7469:7605:-@chr1:7096:7231:-
chr1    RI      exon    17233   17368   .       -       .       ID=chr1:7469:7605:-@chr1:7096:7231:-.B.dn;Parent=chr1:7469:7605:-@chr1:7096:7231:-.B;gid=chr1:7469:7605:-@chr1:7096:7231:-
chr1    RI      exon    17606   17742   .       -       .       ID=chr1:7469:7605:-@chr1:7096:7231:-.B.up;Parent=chr1:7469:7605:-@chr1:7096:7231:-.B;gid=chr1:7469:7605:-@chr1:7096:7231:-
chr1    RI      exon    17606   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-.A.withRI;Parent=chr1:7778:7924:-@chr1:7469:7605:-.A;gid=chr1:7778:7924:-@chr1:7469:7605:-
chr1    RI      exon    17606   17742   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-.B.dn;Parent=chr1:7778:7924:-@chr1:7469:7605:-.B;gid=chr1:7778:7924:-@chr1:7469:7605:-
chr1    RI      exon    17915   18061   .       -       .       ID=chr1:7778:7924:-@chr1:7469:7605:-.B.up;Parent=chr1:7778:7924:-@chr1:7469:7605:-.B;gid=chr1:7778:7924:-@chr1:7469:7605:-
chr1    RI      exon    14407   16765   .       -       .       ID=chr1:4833:6628:-@chr1:4270:4692:-.A.withRI;Parent=chr1:4833:6628:-@chr1:4270:4692:-.A;gid=chr1:4833:6628:-@chr1:4270:4692:-

And TandemUTR.hg19.gff3:

(obot_virtualenv)[obotvinnik@tscc-login2 prettyplotlib]$ grep exon ~/genomes/miso_annotations/hg19/TandemUTR.hg19.gff3 | head
chr19   TandemUTR       exon    10663759        10664625        .       -       .       ID=chr19:10525223:10525625:-@chr19:10524759:10525222:-.A.coreAndExt;Parent=chr19:10525223:10525625:-@chr19:10524759:10525222:-.A;gid=chr19:10525223:10525625:-@chr19:10524759:10525222:-
chr19   TandemUTR       exon    10664223        10664625        .       -       .       ID=chr19:10525223:10525625:-@chr19:10524759:10525222:-.B.core;Parent=chr19:10525223:10525625:-@chr19:10524759:10525222:-.B;gid=chr19:10525223:10525625:-@chr19:10524759:10525222:-
chr1    TandemUTR       exon    227918925       227920091       .       -       .       ID=chr1:225986625:225986714:-@chr1:225985548:225986624:-.A.coreAndExt;Parent=chr1:225986625:225986714:-@chr1:225985548:225986624:-.A;gid=chr1:225986625:225986714:-@chr1:225985548:225986624:-
chr1    TandemUTR       exon    227920002       227920091       .       -       .       ID=chr1:225986625:225986714:-@chr1:225985548:225986624:-.B.core;Parent=chr1:225986625:225986714:-@chr1:225985548:225986624:-.B;gid=chr1:225986625:225986714:-@chr1:225985548:225986624:-
chr7    TandemUTR       exon    89861938        89866931        .       +       .       ID=chr7:89699874:89703121:+@chr7:89703122:89704867:+.A.coreAndExt;Parent=chr7:89699874:89703121:+@chr7:89703122:89704867:+.A;gid=chr7:89699874:89703121:+@chr7:89703122:89704867:+
chr7    TandemUTR       exon    89861938        89865185        .       +       .       ID=chr7:89699874:89703121:+@chr7:89703122:89704867:+.B.core;Parent=chr7:89699874:89703121:+@chr7:89703122:89704867:+.B;gid=chr7:89699874:89703121:+@chr7:89703122:89704867:+
chr3    TandemUTR       exon    111710568       111712215       .       +       .       ID=chr3:113193258:113193388:+@chr3:113193389:113194905:+.A.coreAndExt;Parent=chr3:113193258:113193388:+@chr3:113193389:113194905:+.A;gid=chr3:113193258:113193388:+@chr3:113193389:113194905:+
chr3    TandemUTR       exon    111710568       111710698       .       +       .       ID=chr3:113193258:113193388:+@chr3:113193389:113194905:+.B.core;Parent=chr3:113193258:113193388:+@chr3:113193389:113194905:+.B;gid=chr3:113193258:113193388:+@chr3:113193389:113194905:+
chr20   TandemUTR       exon    6055491 6057818 .       -       .       ID=chr20:6004667:6005818:-@chr20:6003491:6004666:-.A.coreAndExt;Parent=chr20:6004667:6005818:-@chr20:6003491:6004666:-.A;gid=chr20:6004667:6005818:-@chr20:6003491:6004666:-
chr20   TandemUTR       exon    6056667 6057818 .       -       .       ID=chr20:6004667:6005818:-@chr20:6003491:6004666:-.B.core;Parent=chr20:6004667:6005818:-@chr20:6003491:6004666:-.B;gid=chr20:6004667:6005818:-@chr20:6003491:6004666:-
yarden commented 10 years ago

Hi Olga, Thank you for the report and sorry for the quite delayed replied. The short answer is that this is caused because the older version of the annotations (generated in Wang et. al. (2008) - what is labeled on MISO site as 'Ver 1' of the annotations) were made for older genomes, like mm9 and hg18. These were converted to hg19 by liftOver, but their old names (in the IDs) were kept. I completely agree that this is very confusing; so I'll fix it and upload a new version of the annotations. This bug should not occur in annotations that were made using the hg19 genome to start with, labeled as 'Ver 2' on the MISO website, since these did not involve liftOver.

Best, --Yarden

yarden commented 10 years ago

By the way, there's a notice describing this on the annotations page:

warning