Open bitsyamagu opened 5 years ago
Not a transvar author, but chiming in anyways as it's an easy one for me:
At codon 175/176 of NM_001164277.1 there is a difference in the transcript mapping between UCSC and NCBI. You can see an insertion in the genome relative to the transcript. After this position, the numbering of the codons changes between NCBI and UCSC. So this may possibly be due to transvar using the UCSC Refseq mapping and the other tools using the the NCBI mapping. I don't know which mapping is correct or what's exactly happening at the basepair level, it may be interesting for you to investigate the details manually, but all variants after codon 175 will show this -1 difference between different tools.
Here is a UCSC Session link that shows the various tracks (mostly the "NCBI refseq > UCSC Refseq" subtrack) and zoomed around this position: http://genome.ucsc.edu/s/Max/variantPosDiff%2Dtransvar
To the transvar authors: UCSC provides the NCBI mapping in the identical format, so it should be very easy to change the RefSeq/genome mapping in Transvar. The NCBI mapping is the "ncbiRefseq" track, the UCSC mapping is the "refGene" track. Both text files are available on hgdownload.soe.ucsc.edu. As I said, I don't know which one is "correct", but the NCBI mapping is more consistent with other tools.
Thank you for your detailed explanation. I've got the circumstance.
Now I'm trying to replace the original refGene by NCBI mapping on a local copy of TransVar which I installed to an local server. But it doesn't work yet.
You can tell me which table you downloaded and from where and/or tell me if there is any difference in the file format. I believe the file format is exactly identical but of course I could be wrong.
On Wed, May 8, 2019 at 12:36 PM bitsyamagu notifications@github.com wrote:
Thank you for your detailed explanation. I've got the circumstance.
Now I'm trying to replace the original refGene by NCBI mapping on a local copy of TransVar which I installed to an local server. But it doesn't work yet.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zwdzwd/transvar/issues/34#issuecomment-490436712, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TJJKZU6WYJ4TRV6CVLPUKUJ3ANCNFSM4HLFIMTQ .
Actually, I was wrong, the screenshot I sent you shows that this is not a problem of NCBI vs. UCSC mapping but rather a difference between UCSC genes and RefSeq genes. Not sure what really happened here, it seems like the transvar authors should comment on this case since only they know how the original mapping was made and how they count the nucleotides in deletion cases like this that result in a micro-intron with a split codon over it.
On Wed, May 8, 2019 at 12:36 PM bitsyamagu notifications@github.com wrote:
Thank you for your detailed explanation. I've got the circumstance.
Now I'm trying to replace the original refGene by NCBI mapping on a local copy of TransVar which I installed to an local server. But it doesn't work yet.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zwdzwd/transvar/issues/34#issuecomment-490436712, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TJJKZU6WYJ4TRV6CVLPUKUJ3ANCNFSM4HLFIMTQ .
Sorry for being silent until now. I have taken a look at this case. It seems this is brought about by the Patch 13 of GRCh37/hg19 (happened in 2018), which includes a deletion in the reference genome upstream the SLC37A4 locus. See the following for the update info and screenshot
In fact if you look at the original RefSeq annotation gtf, there is a Note that seems to pointing to this discrepancy.
The problem is that there is no specification of the alignment to the patch included in RefSeq gene. I would suggest you either switch to hg38 or use Ensembl annotation which treats the gap as an intron. I am not sure how from the transvar perspective whether we can completely solve this problem since the refseq gene doesn't include this particular gap and patches aren't meant to be applied (https://www.biostars.org/p/45788/). But certainly let me know if you have better suggestion.
In short, this is an extremely rare situation where the genome reference is patched to introduce a deletion in the exon sequence of a gene causing the gene to "frameshift". I have not previously encountered a situation like this before.
There are two errors that many software in this space gets wrong.
Packages incorrectly assume that exon genomic coordinates from NCBI, UCSC, and Ensembl are identical. NCBI aligns transcripts with splign, UCSC realigns with blat, Ensembl by direct genome annotation. Exon coordinates by these methods agree in most cases, but not all. When NCBI and UCSC don't agree, the most common case is the shuffling of a few nucleotides from one side of an intron to the other (thereby preserving the CDS), but I've also seen much larger moves. In 2014 at least, some paralogs were swapped. I summarized a bunch of these differences in 2014 here: https://www.slideshare.net/reecehart/hvp-2014-clinical-significance-of-transcript-alignment-discrepancies. There is good reason to believe that coordinates from splign are much better than those from blat. The problem is that these alignments are like optical lenses: If someone publishes a variant in one of these transcripts using blat alignments, then only blat alignments are guaranteed to project it back to the same genomic location.
Packages incorrectly assume that exons align perfectly or contain only substitution variants. In fact, there are many transcript alignments that contain indels due to polymorphisms (nearly all modulo 3) and sequencing errors (less frequently). Packages that fail to account for these indels when projecting variants from one sequence to another will result in locations that are off by a corresponding amount. See Garla V et al., Bioinformatics 2011. NCBI's tools, Alamut, GeneInsight, and the Python hgvs package deal with genome-transcript indels. I think the Allele Registry also handles such cases, but I've not verified. I know for certain that Mutalyzer (as of last October) and snpEff (as of several years ago) do NOT handle indels correctly. For details about issues with Mutalyzer, see Table 5 in our recent paper Wang, M. et al. Hum. Mutat. (2018). doi:10.1002/humu.23615.
You might be interested in UTA (Universal Transcript Archive), which contains transcripts from NCBI, UCSC, and Ensembl, and their alignments to multiple assemblies using coordinates from the source databases (I don't realign them myself). A public instance of UTA is available at uta.biocommons.org:5432 (postgresql) and as a docker image. Start here: https://hgvs.readthedocs.io/en/stable/installation.html#local-installation-of-uta. UTA is the only comprehensive source of this information that I know of (and it's about 9 months out of date :-( ).
For this specific case, UTA shows the following exon coordinate differences:
anonymous@uta/uta=> select * from tx_exon_aln_v where tx_ac='NM_001164277.1' and alt_ac='NC_000011.9' and alt_aln_method in ('blat','splign') order by ord;
┌─────────┬────────────────┬─────────────┬────────────────┬────────────┬─────┬────────────┬──────────┬─────────────┬───────────┬────────────────────────────────────────────────────────────────
│ hgnc │ tx_ac │ alt_ac │ alt_aln_method │ alt_strand │ ord │ tx_start_i │ tx_end_i │ alt_start_i │ alt_end_i │
├─────────┼────────────────┼─────────────┼────────────────┼────────────┼─────┼────────────┼──────────┼─────────────┼───────────┼────────────────────────────────────────────────────────────────
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 0 │ 0 │ 58 │ 118901558 │ 118901616 │ 58=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 0 │ 0 │ 58 │ 118901558 │ 118901616 │ 58=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 1 │ 58 │ 562 │ 118900941 │ 118901445 │ 504=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 1 │ 58 │ 562 │ 118900941 │ 118901445 │ 504=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 2 │ 562 │ 905 │ 118899931 │ 118900274 │ 343=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 2 │ 562 │ 905 │ 118899931 │ 118900274 │ 343=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 3 │ 905 │ 1138 │ 118898903 │ 118899136 │ 233=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 3 │ 905 │ 1138 │ 118898903 │ 118899136 │ 233=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 4 │ 1138 │ 1382 │ 118898337 │ 118898582 │ 145=1I99=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 4 │ 1138 │ 1382 │ 118898436 │ 118898582 │ 145=98D1=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 5 │ 1382 │ 1541 │ 118897646 │ 118897805 │ 159=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 5 │ 1382 │ 1541 │ 118898337 │ 118898435 │ 6D2=1D1X2=2X1=3D1X1=1D1X3=1X1=5D2=4D4=3D2=2D2=6D5=1I5=2I2=2D1X4
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 6 │ 1541 │ 1627 │ 118897312 │ 118897398 │ 86=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 6 │ 1541 │ 1627 │ 118897646 │ 118897805 │ 3D6=2I1=2I1X1=1X3=1X1=2I5=2I1=1X2=1X2=2I1=2X1=3I3=4I3=1D2=2X2=1
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 7 │ 1627 │ 1741 │ 118896676 │ 118896790 │ 114=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 7 │ 1627 │ 1741 │ 118897312 │ 118897398 │ 1=2X1=2D1=2D2=3D4=2X2=3D1X1=1D3=1X3=1D2X4=2X1=2X1=1X2=2X3=1D1X4
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 8 │ 1741 │ 1880 │ 118895900 │ 118896039 │ 139=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 8 │ 1741 │ 1880 │ 118896676 │ 118896790 │ 1D1X1=1X3=2D2=1D2=4D2X1=1X1=2D2=2D1X2=3X1=1D5=1I1X4=1D2=2X3=2D5
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign │ -1 │ 9 │ 1880 │ 2606 │ 118895060 │ 118895786 │ 726=
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat │ -1 │ 9 │ 1880 │ 2606 │ 118895900 │ 118896039 │ 4D1=5D5=1X1=7D3=3D1=7D1=8D2=3D4=1D1=3D2=1D2=5D2=10D5=1D1=13D1=3
└─────────┴────────────────┴─────────────┴────────────────┴────────────┴─────┴────────────┴──────────┴─────────────┴───────────┴────────────────────────────────────────────────────────────────
ord is the exon ordinal number, 0-based coordinates are interbase (_i) Coordinates shown are those directly from NCBI and UCSC (not realigned by me).
Note that starting in exon 5 (ord 4) that the alignment starts going belly-up. Variants up to n.1283 (c.526, I think) should map consistently (1283=1138+145).
When splign and BLAT disagree on exon coordinates, splign's alignment is nearly always more parsimonious with the genomic sequence than BLAT's and therefore more likely correct. Such is the case above: after the discrepancy in exon 5 above, splign's alignments are more plausible that BLAT's as judged by the cigar strings.
P.S. VariantValidator is a web interface to the hgvs package, which provides most of the algorithmic support. hgvs uses UTA for exon structures.
I don't think the fact that this location was later patched is relevant here, we're looking at the un-patched genome sequence. I think it's enough to say that it's a difference between the Refseq sequence and the genome sequence. I may be wrong, but this is not extremely rare, we found on the order of hundreds of transcripts with genomic indels in hg19 (Reece probably knows the exact number by h(e)art... :-)
@maximilianh @reece I agree with you both. I was not aware there are many cases with indel in the alignments. It has just dawned on me that it makes a lot of sense when there are polymorphism in the dna that affects the structure of the transcript. I will take a closer look at VariantValidator/hgvs/UTA. It seems there is a lot that can be learned. Unfortunately with the current implementation of TransVar, the coordinates are inferred from the gtf. Hence if there are unannotated indels (such as in the case of RefSeq), there is yet a way to capture. But (correct me if wrong) Ensembl annotation doesn't seem to be affected by this issue since they explicitly annotate the indels?
Thanks for the insight from you both!
Thanks everyone for your replies and considerations.
I haven't been aware of these kind of cases, so I think our local repository which we are making also has similar problems, and should be checked and be improved.
The gff files should work fine as a source. The cigar string appears at the end of the line only for imperfect matches, like this:
NC_000001.10 RefSeq match /.../ ...;pct_identity_ungap=99.9308;Gap=M5230 D1 M3574
Ensembl isn't affected by this issue, but it has a different challenge: Because transcripts are derived from spans in the genomic sequence, it's harder to represent transcripts in which the genomic sequence contains a sequencing error. It's been a while since I looked at this, but I'm pretty sure that transcripts in NEFL and ALMS1 had this challenge. My information here is stale, so you should consult with Ensembl folks.
Thanks again for the insight @reece. Makes a lot of sense!
When I submitted a mutation: NM_001164277:c.925_928delinsTC
to Transvar with these options: Reverse annotation cDNA, GRCh37, RefSeq
I got "p.G309Sfs*3" and "chr11:g.118896734_118896737delinsGA" for this mutation.
Although I checked this mutation on the UCSC Genome Browser, the position chr11:g.118896734_118896737delinsGA is 308 of
NM_001164277(attached png).
Other similar services, Mutalyzer and VariantVaridator, returns "309" , and it is consistent with the genome browser.