zwdzwd / transvar

TransVar - multiway annotator for precision medicine
Other
115 stars 34 forks source link

Problem in EnsemblDB for gene_ids with lowercase characters and no gene name #59

Open manulera opened 11 months ago

manulera commented 11 months ago

Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.

I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the gene_id and does not have a gene name:

II  PomBase gene    177469  180628  .   -   .   gene_id "SPBC1198.04c"; gene_biotype "protein_coding";

I can create a database and retreive the gene:

db = AnnoDB(args, read_config())
print('in',list(db.get_gene('SPBC1198.04c')))

However, because of this:

https://github.com/zwdzwd/transvar/blob/f7c17a8def902c00c403066fce791c3ad4eeb355/transvar/anno.py#L193-L195

Or this:

https://github.com/zwdzwd/transvar/blob/f7c17a8def902c00c403066fce791c3ad4eeb355/transvar/anno.py#L153-L155

It's impossible to use it for a protein annotation such as SPBC1198.04c:p.N3A, because the q.tok is made uppercase and it does not exist in db.

A possible fix for this particular case would be to do gene_id.upper() here, although I am not sure it would break something else.

https://github.com/zwdzwd/transvar/blob/f7c17a8def902c00c403066fce791c3ad4eeb355/transvar/localdb.py#L574-L577

Minimal example to show the problem is below, and the files to reproduce:

Then, to set up the transvar config:

# env vars
export TRANSVAR_CFG="$(pwd)/data/transvar.cfg"
export TRANSVAR_DOWNLOAD_DIR="$(pwd)/data/transvar_download"

transvar config -k reference -v "data/pombe_genome.fa" --refversion pombe_genome
transvar index --ensembl data/pombe_genome.gtf --reference data/pombe_genome.fa

Then run the code that gives the error (transvar_main_script below is bin/transvar from which I import the parser functions)

from transvar.annodb import AnnoDB
from transvar.config import read_config
import argparse
from transvar_main_script import parser_add_annotation, parser_add_mutation, parser_add_general
from transvar.anno import main_anno
from functools import partial

parser = argparse.ArgumentParser(description=__doc__)
subparsers = parser.add_subparsers()

p = subparsers.add_parser("panno", help='annotate protein element')
parser_add_annotation(p)
parser_add_mutation(p)
parser_add_general(p)
p.set_defaults(func=partial(main_anno, at='p'))

variant_type = 'panno'
variant_description = 'SPBC1198.04c:p.N3A'
args = parser.parse_args([variant_type, '-i', variant_description, '--ensembl', 'data/pombe_genome.gtf.transvardb', '--reference', 'data/pombe_genome.fa', '-v', '2'])

db = AnnoDB(args, read_config())

print('the gene is found in the db: ',list(db.get_gene('SPBC1198.04c')))

print('the gene is not found in the panno call')
args.func(args)
manulera commented 11 months ago

That solution is not great though, because then when printing the line, it will print the gene_id in uppercase, which is not ideal.

Instead, one could make get_gene try with and without uppercase?