Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.
I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the gene_id and does not have a gene name:
Then run the code that gives the error (transvar_main_script below is bin/transvar from which I import the parser functions)
from transvar.annodb import AnnoDB
from transvar.config import read_config
import argparse
from transvar_main_script import parser_add_annotation, parser_add_mutation, parser_add_general
from transvar.anno import main_anno
from functools import partial
parser = argparse.ArgumentParser(description=__doc__)
subparsers = parser.add_subparsers()
p = subparsers.add_parser("panno", help='annotate protein element')
parser_add_annotation(p)
parser_add_mutation(p)
parser_add_general(p)
p.set_defaults(func=partial(main_anno, at='p'))
variant_type = 'panno'
variant_description = 'SPBC1198.04c:p.N3A'
args = parser.parse_args([variant_type, '-i', variant_description, '--ensembl', 'data/pombe_genome.gtf.transvardb', '--reference', 'data/pombe_genome.fa', '-v', '2'])
db = AnnoDB(args, read_config())
print('the gene is found in the db: ',list(db.get_gene('SPBC1198.04c')))
print('the gene is not found in the panno call')
args.func(args)
Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.
I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the
gene_id
and does not have agene name
:I can create a database and retreive the gene:
However, because of this:
https://github.com/zwdzwd/transvar/blob/f7c17a8def902c00c403066fce791c3ad4eeb355/transvar/anno.py#L193-L195
Or this:
https://github.com/zwdzwd/transvar/blob/f7c17a8def902c00c403066fce791c3ad4eeb355/transvar/anno.py#L153-L155
It's impossible to use it for a protein annotation such as
SPBC1198.04c:p.N3A
, because theq.tok
is made uppercase and it does not exist indb
.A possible fix for this particular case would be to do
gene_id.upper()
here, although I am not sure it would break something else.https://github.com/zwdzwd/transvar/blob/f7c17a8def902c00c403066fce791c3ad4eeb355/transvar/localdb.py#L574-L577
Minimal example to show the problem is below, and the files to reproduce:
data/pombe_genome.fa
: https://curation.pombase.org/dumps/latest_build/fasta/chromosomes/Schizosaccharomyces_pombe_all_chromosomes.fa.gz -o data/pombe_genome.fa.gzdata/pombe_genome.gtf
: pombe_genome.gtf.zipThen, to set up the transvar config:
Then run the code that gives the error (
transvar_main_script
below is bin/transvar from which I import the parser functions)