Open raufs opened 3 months ago
Hey @raufs !
I'm pretty sure this is coming from Pyrodigal since you get the error right after loading sequences, so probably at the time Pyrodigal is executed. There's a chance the CPU feature detection doesn't work properly and causes the wrong platform code to be executed. Would you mind trying to run the Pyrodigal CLI on a test example, and check if you get the error with the latest version? According to my changelog GECCO v0.9.6
should be using Pyrodigal v2.0.0
while v0.9.10
is using v3.0.0
so it would be helpful if you could confirm the bug is happening on either of those. If that's indeed a Pyrodigal bug I'll transfer the issue there.
Cheers, Martin
Pyrodigal versions appear the same in both conda environments with GECCO v0.9.6 and GECCO v0.9.10. Both environments have pyrodigal v3.5.1.
It seems I only tested running GECCO with a full genome GenBank file as input and this was the basis of my initial report. However, when testing using genomes in FASTA format as input, we see the reverse scenario where 0.9.10 works as expected 0.9.6 doesn't work giving the error message:
(/Users/raufs/Coding_Projects/test_gecco/gecco_0.9.6) Raufs-Mac-mini:input_genomes raufs$ gecco run -g Cutibacterium_avidum_GB_GCA_000477695_1.fasta -o test/
x An unexpected error occurred. Consider opening a new issue on the bug tracker ( https://github.com/zellerlab/GECCO/issues/new ) if it persists, including the traceback below:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/raufs/Coding_Projects/test_gecco/gecco_0.9.6/lib/python3.12/site-packages/gecco/cli/comma │
│ │
│ 156 │ │ │ │ subcmd.quiet = self.quiet │
│ 157 │ │ │ │ subcmd.progress.disable = self.args["--no-progress-bar"] │
│ 158 │ │ │ # run the subcommand │
│ ❱ 159 │ │ │ return subcmd.execute(ctx) │
│ 160 │ │ except CommandExit as sysexit: │
│ 161 │ │ │ return sysexit.code │
│ 162 │ │ except KeyboardInterrupt: │
│ │
│ /Users/raufs/Coding_Projects/test_gecco/gecco_0.9.6/lib/python3.12/site-packages/gecco/cli/comma │
│ │
│ 254 │ │ │ self._make_output_directory(outputs) │
│ 255 │ │ │ # load sequences and extract genes │
│ 256 │ │ │ sequences = list(self._load_sequences()) │
│ ❱ 257 │ │ │ genes = self._extract_genes(sequences) │
│ 258 │ │ │ if genes: │
│ 259 │ │ │ │ self.success("Found", "a total of", len(genes), "genes", level=1) │
│ 260 │ │ │ else: │
│ │
│ /Users/raufs/Coding_Projects/test_gecco/gecco_0.9.6/lib/python3.12/site-packages/gecco/cli/comma │
│ │
│ 135 │ │ [self.info](http://self.info/)("Extracting", "genes from input sequences", level=1) │
│ 136 │ │ if self.cds_feature is None: │
│ 137 │ │ │ [self.info](http://self.info/)("Using", "Pyrodigal in metagenomic mode", level=2) │
│ ❱ 138 │ │ │ orf_finder: ORFFinder = PyrodigalFinder(metagenome=True, mask=self.mask, cpus= │
│ 139 │ │ else: │
│ 140 │ │ │ [self.info](http://self.info/)("Using", f"record features named {self.cds_feature!r}", level=2) │
│ 141 │ │ │ orf_finder = CDSFinder(feature=self.cds_feature, locus_tag=self.locus_tag) │
│ │
│ /Users/raufs/Coding_Projects/test_gecco/gecco_0.9.6/lib/python3.12/site-packages/gecco/orf.py:72 │
│ │
│ 69 │ │ self.metagenome = metagenome │
│ 70 │ │ self.mask = mask │
│ 71 │ │ self.cpus = cpus │
│ ❱ 72 │ │ self.orf_finder = pyrodigal.OrfFinder(meta=metagenome, mask=mask) │
│ 73 │ │
│ 74 │ def _train(self, records: Iterable[SeqRecord]) -> pyrodigal.TrainingInfo: │
│ 75 │ │ sequences = [] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: module 'pyrodigal' has no attribute 'OrfFinder'
So perhaps the issue is related to parsing GenBank inputs in v0.9.10. And as far as I have experienced, everything works as expected on Linux setups regardless of versions used.
Hope its helpful! Rauf
Here is an example full genome GenBank I used as input for the initial report:
Cutibacterium_avidum_RS_GCF_902375045_1.gbk.gz
I self create them in lsaBGC-Pan, so perhaps it is something on my end to improve with creating them. Will definitely look into later. They do work as compatible input for GECCO v0.9.6 (when using Mac M2 and also on Linux) and also v0.9.10 (but only on Linux).
Ah sorry -- what I meant was to try running Pyrodigal directly on some genome with your laptop to see if it's the culprit that is crashing GECCO.
The "Illegal Instruction" interrupt happens when the CPU attempts to run, well, an illegal instruction. This usually happens when the CPU tries to run SIMD code it doesn't support, so for instance AVX2 or SSE4.1 on older computers. Since Pyrodigal and PyHMMER both use SIMD, I guess it's one of these two which crash, but I'd probably think Pyrodigal is the culprit because the crash happens when the progress bar is done Loading sequences
.
Maybe you could just try:
$ python -m pyrodigal -i <some_genome_file.fna>
to check if this works, or if it immediately crashes?
I see, so what is odd here is that pyrodigal (v3.5.1) runs great on all systems. I actually use it to create the custom GenBank files to feed into GECCO. So running the pyrodigal command in a conda environment where GECCO reports the "Illegal instruction" works great on the M2 mac.
This is related to the lsaBGC-Pan suite, which is a re-implementation of lsaBGC, and I just get around this by using GECCO v0.9.6 as a dependency, which works great - so there is no rush here!
Maybe of interest to you and the other GECCO co-authors, but lsaBGC-Pan can now co-process both antiSMASH and GECCO predictions. Similar to your study, applying this to a well-sequenced Streptomyces species, we saw that this leads to a substantial increase in BGC predictions to using antiSMASH alone.
Hmm..... Would you mind running GECCO in verbose mode? If Pyrodigal is not the culprit I'm wondering what the problem may be... You can run gecco -vv run
instead of gecco run
to get the verbose output and not the progress bar.
Maybe of interest to you and the other GECCO co-authors, but lsaBGC-Pan can now co-process both antiSMASH and GECCO predictions. Similar to your study, applying this to a well-sequenced Streptomyces species, we saw that this leads to a substantial increase in BGC predictions to using antiSMASH alone.
I've seen your tweet about it, that's really exciting!
Sure thing, here is the more detailed output:
(/Users/raufs/Coding_Projects/test_gecco/gecco_0.9.10) Raufs-Mac-mini:Gene_Calling raufs$ gecco -vv run -g Cutibacterium_avidum_GB_GCA_000413335_1.gbk
2024-08-29 12:15:04 Raufs-Mac-mini.local gecco[3348] INFO Using output folder '.'
2024-08-29 12:15:04 Raufs-Mac-mini.local gecco[3348] INFO Detecting sequence format from file contents
2024-08-29 12:15:04 Raufs-Mac-mini.local gecco[3348] OK Detected format of input as 'genbank'
2024-08-29 12:15:04 Raufs-Mac-mini.local gecco[3348] INFO Loading sequences from genomic file 'Cutibacterium_avidum_GB_GCA_000413335_1.gbk'
2024-08-29 12:15:04 Raufs-Mac-mini.local gecco[3348] OK Found 2 sequences
✔ Loading sequences ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.7/4.7 MiB 100% 0:00:00 0:00:00Illegal instruction: 4
Hi Martin,
Hope all is well! I just noticed on my Mac setup that versions v0.9.8 and v0.9.10 reports the following and exists:
This error does not occur on linux systems with these versions. Reverting to v0.9.6 appears to work as expected on the mac. I tried running with the verbose flag but it just gave the same error message.
The mac has an M2 chip if that helps. Installation was via conda. Happy to help with testing or share additional info.
Kind regards, Rauf