yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 41 forks source link

Inconsistent usage of ORF1ab vs ORF1a/b in Usher's Auspice visualization #174

Open corneliusroemer opened 2 years ago

corneliusroemer commented 2 years ago

I love Usher, it's so useful. I'm not sure whether this is the right repo to open this issue, so please refer me on if this is not the right one.

One usability issue I have with Usher is that I can't colour by ORF1ab mutations, because you label your AA muts with ORF1ab whereas the metadata that you pass to Auspice in the auspice.json seems to expect ORF1a / ORF1b.

A quickfix would be to adjust auspice.json's metadata so that colour by ORF1ab works

A more wide ranging change (that would help me, but may not be desired) would be to change from labelling ORF1ab to ORF1a and ORF1b. All the software I use regularly: Nextstrain, Nextclade, covSPECTRUM, outbreak.info uses the ORF1a / ORF1b convention.

I now know how to mentally convert from ORF1ab positions to ORF1b but it is a bit of a headache always subtracting 4401 (luckily it's close to a hundred).

Here you can see the issue visually.

image image

corneliusroemer commented 2 years ago

I dug into your auspice.json and you'd have to change the genome annotation here, add ORF1ab:

"genome_annotations":{
   "E":{
      "end":26472,
      "start":26245,
      "strand":"+",
      "type":"CDS"
   },
   "M":{
      "end":27191,
      "start":26523,
      "strand":"+",
      "type":"CDS"
   },
   "N":{
      "end":29533,
      "start":28274,
      "strand":"+",
      "type":"CDS"
   },
   "ORF1a":{
      "end":13468,
      "start":266,
      "strand":"+",
      "type":"CDS"
   },
   "ORF1b":{
      "end":21555,
      "start":13468,
      "strand":"+",
      "type":"CDS"
   },
   "ORF3a":{
      "end":26220,
      "start":25393,
      "strand":"+",
      "type":"CDS"
   },
   "ORF6":{
      "end":27387,
      "start":27202,
      "strand":"+",
      "type":"CDS"
   },
   "ORF7a":{
      "end":27759,
      "start":27394,
      "strand":"+",
      "type":"CDS"
   },
   "ORF7b":{
      "end":27887,
      "start":27756,
      "strand":"+",
      "type":"CDS"
   },
   "ORF8":{
      "end":28259,
      "start":27894,
      "strand":"+",
      "type":"CDS"
   },
   "ORF9b":{
      "end":28577,
      "start":28284,
      "strand":"+",
      "type":"CDS"
   },
   "S":{
      "end":25384,
      "start":21563,
      "strand":"+",
      "type":"CDS"
   },
   "nuc":{
      "end":29903,
      "start":1,
      "strand":"+",
      "type":"source"
   }
}

Add this:

   "ORF1ab":{
      "end":21555,
      "start":266,
      "strand":"+",
      "type":"CDS"
   }
russcd commented 2 years ago

Thanks @corneliusroemer !

@amkram can you look into this? I think it might be as easy as pulling a different gtf file for translation.

corneliusroemer commented 2 years ago

@russcd Thanks for your quick reply! It depends on what you want to do: switch to Orf1a Orf1b then you don't need to do anything to auspice.json or change auspice.json to do the genotype search by Orf1ab.

AngieHinrichs commented 2 years ago

@corneliusroemer Thanks so much for pointing out that inconsistency! I have changed the JSON produced by the web interface on our test server: https://hgwdev.gi.ucsc.edu/cgi-bin/hgPhyloPlace and the coloring seems to work, please let me know if you have any problems with it. The change won't take effect on the main site (genome.ucsc.edu) until the afternoon (Pacific time) of Oct. 26.

corneliusroemer commented 2 years ago

Thanks! I confirm the coloring works now.

Have you made a conscious decision to use ORF1ab as opposed to the somewhat more conventional ORF1a and ORF1b? At least a/b are used by the projects mentioned above. I don't know of any other than Usher using ab. But then I'm new to the field... Just something I noticed.

On Fri, Oct 15, 2021, 19:27 Angie Hinrichs @.***> wrote:

@corneliusroemer https://github.com/corneliusroemer Thanks so much for pointing out that inconsistency! I have changed the JSON produced by the web interface on our test server: https://hgwdev.gi.ucsc.edu/cgi-bin/hgPhyloPlace and the coloring seems to work, please let me know if you have any problems with it. The change won't take effect on the main site (genome.ucsc.edu) until the afternoon (Pacific time) of Oct. 26.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yatisht/usher/issues/174#issuecomment-944472088, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF77AQOAEZYTJ4EU72JS7VTUHBP63ANCNFSM5GCDTQPA .

AngieHinrichs commented 2 years ago

Nextstrain uses ORF1a and ORF1b, and Nextstrain is the 900lb gorilla of viral phylogenetics & visualization. :) However, the genes annotated on RefSeq NC_045512.2 include only ORF1ab, and that is what the UCSC Genome Browser and UShER web interface use for gene annotations. I think the reason for the difference is the ribosomal slippage frameshift which the RefSeq record describes like this:

     CDS             join(266..13468,13468..21555)
                     /gene="ORF1ab"
                     /locus_tag="GU280_gp01"
                     /ribosomal_slippage
                     /note="pp1ab; translated by -1 ribosomal frameshift"

(detailed explanation in Bhatt et al.).

Ribosomal slippage throws a monkeywrench into the usual simple translation of nucleotide coordinates to protein coordinates. One way around that difficulty is to define separate transcripts "ORF1a" and "ORF1b".

AngieHinrichs commented 2 years ago

Biologically, as far as I know, there is no "ORF1b" transcript, only ORF1a (when transcription terminates instead of slipping & continuing) or ORF1ab (when slippage happens and transcription continues). In RefSeq NC_045512.2, there are actually two CDS's defined for gene ORF1ab, one with slippage (making "ORF1ab polyprotein" pp1ab) and one without (making "ORF1a polyprotein" pp1a). The mat_peptide stanzas for the nsp* proteins note whether each one is produced by pp1a, pp1ab or both.

corneliusroemer commented 2 years ago

I let you decide! It's just a pain to convert. I created a Stackoverflow Q&A to make it easy for anyone to convert just by googling "How to convert from ORF1ab to ORF1b" ;)

The answer is not 42 but 44001.

https://bioinformatics.stackexchange.com/questions/17868/what-do-i-need-to-subtract-from-sars-cov-2-orf1ab-positions-to-get-orf1b-positio/17869#17869