zellerlab / GECCO

GEne Cluster prediction with COnditional random fields.
https://gecco.embl.de
GNU General Public License v3.0
60 stars 7 forks source link

CDS mode reports wrong coordinate end #9

Closed xvazquezc closed 2 years ago

xvazquezc commented 2 years ago

Hi there,

I've been running GECCO on a few annotated genomes with --cds-feature CDS and sideloaded them in antiSMASH without problems. However, there is one genome with multiple CDS in several genes so I had to add the --locus-tag protein_id so GECCO doesn't crash with something like this:

$ gecco run -g genomic.gbk -o gecco_test --antismash-sideload --cds-feature CDS  -j $NCPUS
x An unexpected error occurred. Consider opening a new issue on the bug tracker ( https://github.com/zellerlab/GECCO/issues/new ) if it persists, including the traceback below:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/z3382651/miniconda3/envs/gecco/lib/python3.9/site-packages/gecco/cli/commands/_main.py:153 │
│                                                                                                  │
│   150 │   │   │   │   subcmd.quiet = self.quiet                                                  │
│   151 │   │   │   │   subcmd.progress.disable = self.quiet > 0                                   │
│   152 │   │   │   # run the subcommand                                                           │
│ ❱ 153 │   │   │   return subcmd.execute(ctx)                                                     │
│   154 │   │   except CommandExit as sysexit:                                                     │
│   155 │   │   │   return sysexit.code                                                            │
│   156 │   │   except KeyboardInterrupt:                                                          │
│                                                                                                  │
│ /home/z3382651/miniconda3/envs/gecco/lib/python3.9/site-packages/gecco/cli/commands/run.py:332 i │
│                                                                                                  │
│   329 │   │   │   self._make_output_directory(extensions)                                        │
│   330 │   │   │   # load sequences and extract genes                                             │
│   331 │   │   │   sequences = self._load_sequences()                                             │
│ ❱ 332 │   │   │   genes = self._extract_genes(sequences)                                         │
│   333 │   │   │   if genes:                                                                      │
│   334 │   │   │   │   self.success("Found", "a total of", len(genes), "genes", level=1)          │
│   335 │   │   │   else:                                                                          │
│                                                                                                  │
│ /home/z3382651/miniconda3/envs/gecco/lib/python3.9/site-packages/gecco/cli/commands/annotate.py: │
│                                                                                                  │
│   233 │   │   │   self.success("Found", found, "genes in record", repr(record.id), level=2)      │
│   234 │   │   │   self.progress.update(task, advance=1)                                          │
│   235 │   │                                                                                      │
│ ❱ 236 │   │   return list(orf_finder.find_genes(sequences, progress=callback))                   │
│   237 │                                                                                          │
│   238 │   def _annotate_domains(self, genes: List["Gene"], whitelist: Optional[Container[str]] = │
│   239 │   │   from ...hmmer import PyHMMER, embedded_hmms                                        │
│                                                                                                  │
│ /home/z3382651/miniconda3/envs/gecco/lib/python3.9/site-packages/gecco/orf.py:187 in find_genes  │
│                                                                                                  │
│   184 │   │   │   │   │   protein = Protein(id=f"{record.id}_{i+1}", seq=prot_seq)               │
│   185 │   │   │   │   # check IDs are unique                                                     │
│   186 │   │   │   │   if protein.id in ids:                                                      │
│ ❱ 187 │   │   │   │   │   raise ValueError(f"Duplicate gene identifier found in {record.id!r}: { │
│   188 │   │   │   │   ids.add(protein.id)                                                        │
│   189 │   │   │   │   # fix coordinates (using 1-based, leftmost start in `Gene`, no STOP codon) │
│   190 │   │   │   │   if feature.location.strand == -1:                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Duplicate gene identifier found in 'QVQW01000001.1': 'DL546_001492'

Once I add --locus-tag protein_id, it runs without problems. However, it made antiSMASH crash while sideloading:

ERROR    10/06 06:01:10   sideloaded area contains no complete CDS features in QVQW01000070.1: Subregion(GECCO, 45922-53301, RiPP)

I took a look at the GECCO output and those are indeed the coordinates reported in the *.sideload.json and the *.features.tsv. I then inspected the source gbk file and I found out that the coordinates for the reported proteins are actually 3 nt before the actual protein coordinates, i.e. 53301 instead of 53304.

These are the features reported in the range that antiSMASH complains:

QVQW01000070.1  RKU41646.1      45922   48397   -       PF00172 Pfam    2.1697438388095284e-07  7.84433781203734e-11    17      50      0.9998662132334699
QVQW01000070.1  RKU41647.1      49277   53301   +       PF00664 Pfam    8.584690299391551e-43   3.1036479751957886e-46  12      285     0.9990239121540062
QVQW01000070.1  RKU41647.1      49277   53301   +       PF00005 Pfam    1.019473891851838e-32   3.685733520794786e-36   356     514     0.9990239121540062
QVQW01000070.1  RKU41647.1      49277   53301   +       PF00664 Pfam    4.819150001206918e-48   1.7422812730321468e-51  653     922     0.9990239121540062
QVQW01000070.1  RKU41647.1      49277   53301   +       PF00005 Pfam    1.9178124750314499e-31  6.933523047836044e-35   994     1151    0.9990239121540062
QVQW01000070.1  RKU41651.1      49557   53301   +       PF00664 Pfam    9.83159809886755e-32    3.5544461673418474e-35  8       210     0.9980173661507098
QVQW01000070.1  RKU41651.1      49557   53301   +       PF00005 Pfam    9.31573975313726e-33    3.3679464038818724e-36  281     439     0.9980173661507098
QVQW01000070.1  RKU41651.1      49557   53301   +       PF00664 Pfam    4.3291279121459376e-48  1.565122166357895e-51   578     847     0.9980173661507098
QVQW01000070.1  RKU41651.1      49557   53301   +       PF00005 Pfam    1.7526661696993695e-31  6.336464821762001e-35   919     1076    0.9980173661507098
QVQW01000070.1  RKU41650.1      49930   53301   +       PF00664 Pfam    2.573268801701548e-08   9.303213310562356e-12   6       107     0.9965288403076623
QVQW01000070.1  RKU41650.1      49930   53301   +       PF00005 Pfam    8.123409368073151e-33   2.9368797426150224e-36  178     336     0.9965288403076623
QVQW01000070.1  RKU41650.1      49930   53301   +       PF00664 Pfam    3.673538032748807e-48   1.3281048563806245e-51  475     744     0.9965288403076623
QVQW01000070.1  RKU41650.1      49930   53301   +       PF00005 Pfam    1.5286475222431794e-31  5.526563710206723e-35   816     973     0.9965288403076623
QVQW01000070.1  RKU41648.1      50343   53301   +       PF00005 Pfam    6.760321537712605e-33   2.4440786470399873e-36  58      216     0.9943062519881032
QVQW01000070.1  RKU41648.1      50343   53301   +       PF00664 Pfam    2.9386520726827317e-48  1.0624194044406115e-51  355     624     0.9943062519881032
QVQW01000070.1  RKU41648.1      50343   53301   +       PF00005 Pfam    1.272544564498185e-31   4.600667261381724e-35   696     853     0.9943062519881032
QVQW01000070.1  RKU41649.1      50880   53301   +       PF00005 Pfam    2.1871313608716245e-11  7.90719942469857e-15    6       55      0.9909732046876351
QVQW01000070.1  RKU41649.1      50880   53301   +       PF00664 Pfam    2.017067394657362e-48   7.2923622366499e-52     194     463     0.9909732046876351
QVQW01000070.1  RKU41649.1      50880   53301   +       PF00005 Pfam    9.401288594063467e-32   3.398875124390263e-35   535     692     0.9909732046876351

And these are the corresponding annotations from the source gbk:

     gene            complement(45660..48678)
                     /locus_tag="DL546_002159"
     mRNA            complement(join(45660..46357,46417..47485,47549..48027,
                     48079..48203,48265..48288,48344..48678))
                     /locus_tag="DL546_002159"
                     /product="hypothetical protein"
     CDS             complement(join(45919..46357,46417..47485,47549..48027,
                     48079..48203,48265..48288,48344..48397))
                     /locus_tag="DL546_002159"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="RKU41646.1"
                     /translation="MEGVGETSDEPHGSLRRACDQCRFRKIRCDKVTGTPCYHCRAAK
                     RECTSTGGQKRKDGRQRVSVSHSYERKIELVGQRLADIEKTLSNLTHLTISLGSSGTL
                     GRMSMTATAPSHASTGMESSTAYVTPSVDEETDETFEGNSSMTAHTVFASDFMEQAVT
                     SPLFNEKLSHDMKKALASLRQMVHLQSRRQAVHESIFVHQKPIFEGGLNQLPLPPTDI
                     VLRVLRETKNAPSETFMRYCVFITIRGFVDYCQRVFFATEEYSPTAFVIVNAGLYFLF
                     QERSLLADGSVNKAYRDLQNLCRDNLETALANLPLLLPPKRETVEALLLGVSLLVTLS
                     CCVFLAQCAQVLYSIDISKFTIAWQLNSAAASMCQALGWHRIQLTETEDTDDTRLASF
                     WFCYMQDKSLALRFGRTSVIRDLEITAPRCFGNMTDLSESWKHITALWIQTGSILGDM
                     YDHLYSPEALARPPEGRIETARRLADRKKQLFHGLEETSAALKQDADLVTAPSVGDAT
                     DSATRSKMVDMTVKSAEVSHLACLTLIYRALPPSPSFPSSFNVECLEAARLAFKRHAE
                     CMELSSDNFFARVGYLRWTLLYGTFSPLIVLFCHVIETSNQDDLQQLANFTASLEPLV
                     SLSLAMEKFYRLSRTLCQVATLYVEVKTQGQDQHDQDTSSISNDFDVYLNQLGFISSG
                     QHNSSGDMAIPGSNAAPDLETWFSGNSYIMGLMEEDLLDFDTHLNPP"
     gene            49277..53608
                     /locus_tag="DL546_002293"
     mRNA            join(49277..50231,50285..50711,50766..50921,50978..51542,
                     51596..53608)
                     /locus_tag="DL546_002293"
                     /product="hypothetical protein"
     mRNA            join(49277..49565,49630..50231,50285..50711,50766..50921,
                     50978..51542,51596..53608)
                     /locus_tag="DL546_002293"
                     /product="hypothetical protein"
     mRNA            join(49277..49455,49511..49565,49630..50711,50766..50921,
                     50978..51542,51596..53608)
                     /locus_tag="DL546_002293"
                     /product="hypothetical protein"
     mRNA            join(49277..49455,49511..49565,49630..50231,50285..50711,
                     50766..50921,50978..51542,51596..53608)
                     /locus_tag="DL546_002293"
                     /product="hypothetical protein"
     CDS             join(49277..49455,49511..49565,49630..50231,50285..50711,
                     50766..50921,50978..51542,51596..53304)
                     /locus_tag="DL546_002293"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="RKU41647.1"
                     /translation="MYNDTIGWITTVLGIICMVAAGVLLPVMNFVFGKFVTVFNDFII
                     GKKSPEDFRSSINHYTLYFVYLFVAKFVLSYMWTTIVSINAIRLTRSLRIDFLKQTLR
                     QEIPYFDSAEAGSIAGNINRGGNLVNQGISERFGLTVQATTTFFSAFIVAFAVQWKLT
                     LICLSIVAANLIVVTVCVMIDSGIENKLNATWGEADKLAEEVFASIRNVHAFWAYGKL
                     SAKFEGLMQSTRHLAQRKPPIYAILFSVQFFCIYAGYGLAFWQGIRMYHRGEIDQPGG
                     VVTVILAVLLAAQGLTQIAPQIMVVSKAVGAADGLFKTIDRESKIDSLSTRGTTPQDC
                     HGEILLDKVQFAYPSRPSVQVLNGLSLVIPANKTTAIVGASGSGKSTIVSLLERWFEP
                     TSGTITFDGQPIQTLHISWLRINMRLVQQEPVLFSGTVYQNVVYGLSGTPQAELADDI
                     KLRLVEQACMAAFAHDFIEKLPDGYHTEIGERGRMLSGGQKQRLAIARAIISNPRVLL
                     LDEATSALDANAEHVVQQALNHVAAGRTTVVIAHRLSTVRGADNIVVMAKGTIVEQGT
                     HEELMRHGGAYFRLVRAQQLGRDDMGEDAPLHDDAEQPTTAPKTLSANALETNPEQAA
                     VQADIHYNLMKCLAIIIKEQRNLWFPCAIVGLAAVIGGGMYPALAVLFSRVLDAFALT
                     GDAMLKRGDFYALMFFVMALGNLVAYAAMGWMSSLVSQEIARSYRLDIFNNLIRQEMT
                     FFDDADNGTGALVSRLSTEPTAIQELLSSNIALLLTISVNLTSSCVLALAYGWKLGLT
                     LTFGALPPLVAAGYVRIQLESRLDKETASRFANSASVAAEAVSAIRTVASLTMENEVL
                     AKYEDSLRYVTRASAKSLVRTMFWYALSQSISFLSMALGFWYGGRLISFNEYTSQQFY
                     TVFVAVIFSGEAAASLFQYTSSITQAQGAANYVFNLRRQVDKDMRDNYPPRDGHSHSG
                     AAQVECKDLVFSYPRRPGSRVLNEVTLSVQPGQFIAFVGASGCGKTTMISLLERFYEP
                     TGGTILLDGVDSISSHLGQYRRHIALVQQEPVLYQGSLRENIALGIEDLPGGGSSTVT
                     DEDILEACRQANIDTFILSLPDGLSTRCGSQGLQFSGGQRQRIALARALTRKPRLLLL
                     DEATSSLDTESERIVQAALDDAAKGENTKRTTIAVAHRLSTIRNADLIFVFSRGRITE
                     VGRHEDLVRRKGMYHQMFLAQSLDEA"
     CDS             join(49557..49565,49630..50231,50285..50711,50766..50921,
                     50978..51542,51596..53304)
                     /locus_tag="DL546_002293"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="RKU41651.1"
                     /translation="MWTTIVSINAIRLTRSLRIDFLKQTLRQEIPYFDSAEAGSIAGN
                     INRGGNLVNQGISERFGLTVQATTTFFSAFIVAFAVQWKLTLICLSIVAANLIVVTVC
                     VMIDSGIENKLNATWGEADKLAEEVFASIRNVHAFWAYGKLSAKFEGLMQSTRHLAQR
                     KPPIYAILFSVQFFCIYAGYGLAFWQGIRMYHRGEIDQPGGVVTVILAVLLAAQGLTQ
                     IAPQIMVVSKAVGAADGLFKTIDRESKIDSLSTRGTTPQDCHGEILLDKVQFAYPSRP
                     SVQVLNGLSLVIPANKTTAIVGASGSGKSTIVSLLERWFEPTSGTITFDGQPIQTLHI
                     SWLRINMRLVQQEPVLFSGTVYQNVVYGLSGTPQAELADDIKLRLVEQACMAAFAHDF
                     IEKLPDGYHTEIGERGRMLSGGQKQRLAIARAIISNPRVLLLDEATSALDANAEHVVQ
                     QALNHVAAGRTTVVIAHRLSTVRGADNIVVMAKGTIVEQGTHEELMRHGGAYFRLVRA
                     QQLGRDDMGEDAPLHDDAEQPTTAPKTLSANALETNPEQAAVQADIHYNLMKCLAIII
                     KEQRNLWFPCAIVGLAAVIGGGMYPALAVLFSRVLDAFALTGDAMLKRGDFYALMFFV
                     MALGNLVAYAAMGWMSSLVSQEIARSYRLDIFNNLIRQEMTFFDDADNGTGALVSRLS
                     TEPTAIQELLSSNIALLLTISVNLTSSCVLALAYGWKLGLTLTFGALPPLVAAGYVRI
                     QLESRLDKETASRFANSASVAAEAVSAIRTVASLTMENEVLAKYEDSLRYVTRASAKS
                     LVRTMFWYALSQSISFLSMALGFWYGGRLISFNEYTSQQFYTVFVAVIFSGEAAASLF
                     QYTSSITQAQGAANYVFNLRRQVDKDMRDNYPPRDGHSHSGAAQVECKDLVFSYPRRP
                     GSRVLNEVTLSVQPGQFIAFVGASGCGKTTMISLLERFYEPTGGTILLDGVDSISSHL
                     GQYRRHIALVQQEPVLYQGSLRENIALGIEDLPGGGSSTVTDEDILEACRQANIDTFI
                     LSLPDGLSTRCGSQGLQFSGGQRQRIALARALTRKPRLLLLDEATSSLDTESERIVQA
                     ALDDAAKGENTKRTTIAVAHRLSTIRNADLIFVFSRGRITEVGRHEDLVRRKGMYHQM
                     FLAQSLDEA"
     CDS             join(49930..50231,50285..50711,50766..50921,50978..51542,
                     51596..53304)
                     /locus_tag="DL546_002293"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="RKU41650.1"
                     /translation="MIDSGIENKLNATWGEADKLAEEVFASIRNVHAFWAYGKLSAKF
                     EGLMQSTRHLAQRKPPIYAILFSVQFFCIYAGYGLAFWQGIRMYHRGEIDQPGGVVTV
                     ILAVLLAAQGLTQIAPQIMVVSKAVGAADGLFKTIDRESKIDSLSTRGTTPQDCHGEI
                     LLDKVQFAYPSRPSVQVLNGLSLVIPANKTTAIVGASGSGKSTIVSLLERWFEPTSGT
                     ITFDGQPIQTLHISWLRINMRLVQQEPVLFSGTVYQNVVYGLSGTPQAELADDIKLRL
                     VEQACMAAFAHDFIEKLPDGYHTEIGERGRMLSGGQKQRLAIARAIISNPRVLLLDEA
                     TSALDANAEHVVQQALNHVAAGRTTVVIAHRLSTVRGADNIVVMAKGTIVEQGTHEEL
                     MRHGGAYFRLVRAQQLGRDDMGEDAPLHDDAEQPTTAPKTLSANALETNPEQAAVQAD
                     IHYNLMKCLAIIIKEQRNLWFPCAIVGLAAVIGGGMYPALAVLFSRVLDAFALTGDAM
                     LKRGDFYALMFFVMALGNLVAYAAMGWMSSLVSQEIARSYRLDIFNNLIRQEMTFFDD
                     ADNGTGALVSRLSTEPTAIQELLSSNIALLLTISVNLTSSCVLALAYGWKLGLTLTFG
                     ALPPLVAAGYVRIQLESRLDKETASRFANSASVAAEAVSAIRTVASLTMENEVLAKYE
                     DSLRYVTRASAKSLVRTMFWYALSQSISFLSMALGFWYGGRLISFNEYTSQQFYTVFV
                     AVIFSGEAAASLFQYTSSITQAQGAANYVFNLRRQVDKDMRDNYPPRDGHSHSGAAQV
                     ECKDLVFSYPRRPGSRVLNEVTLSVQPGQFIAFVGASGCGKTTMISLLERFYEPTGGT
                     ILLDGVDSISSHLGQYRRHIALVQQEPVLYQGSLRENIALGIEDLPGGGSSTVTDEDI
                     LEACRQANIDTFILSLPDGLSTRCGSQGLQFSGGQRQRIALARALTRKPRLLLLDEAT
                     SSLDTESERIVQAALDDAAKGENTKRTTIAVAHRLSTIRNADLIFVFSRGRITEVGRH
                     EDLVRRKGMYHQMFLAQSLDEA"

My guess is that it is an issue with L195 in orf.py, which assumes the presence of stop codon and removes it.

Regards

althonos commented 2 years ago

Yes, absolutely, I don't think it should matter that much whether a STOP codon is there or not, given that Prodigal/pyrodigal is going to emit one in the default ORF finder. I've made a patch for the next release so that GECCO trusts the GenBank annotations and doesn't try to fiddle that much with them.

The --locus-tag flag is doing exactly its job though, by default it's using locus_tag qualifiers to extract the genes, so if you have several CDS in a gene then you're bound to have a problem. However I'd rather detect it earlier while loading the data than having duplicate IDs be the cause of random bugs later in the pipeline :yum: