wwood / singlem

Novelty-inclusive microbial community profiling of shotgun metagenomes
http://wwood.github.io/singlem/
GNU General Public License v3.0
119 stars 16 forks source link

graphm intermittent failures "extern.ExternCalledProcessError: Command mfqe" #90

Closed bfoster-lbl closed 1 year ago

bfoster-lbl commented 2 years ago

Hi I am running into issues where I get about 30% failure rate with the error found in the subject. These are re-run successfully. I am wondering if you have experience with this type of error?

I am running at aws and here is the error file:

07/18/2022 09:52:45 PM INFO: SingleM v0.13.2-dev11.a6cc1b4 07/18/2022 09:52:51 PM INFO: Loaded 83 SingleM packages 07/18/2022 09:53:00 PM INFO: Using as input 1 different pairs of sequence files e.g. /cromwell_root/bf-20190529-uswest2-s3/cromwell-execution/run_singlem/582ffe20-7b3d-43de-81a8-2cc0df1f93ff/call-split_reads/r1.fastq & /cromwell_root/bf-20190529-uswest2-s3/cromwell-execution/run_singlem/582ffe20-7b3d-43de-81a8-2cc0df1f93ff/call-split_reads/r2.fastq 07/18/2022 09:53:00 PM INFO: Filtering sequence files through DIAMOND blastx 07/18/2022 09:55:13 PM INFO: Finished DIAMOND prefilter phase 07/18/2022 09:55:13 PM INFO: Assigning sequences to SingleM packages with HMMSEARCH .. 07/18/2022 09:55:13 PM INFO: Searching with 83 SingleM package(s) 07/18/2022 09:55:13 PM INFO: Searching for reads matching 102 different protein HMM(s) Traceback (most recent call last): File "/singlem/bin/singlem", line 584, in diamond_taxonomy_assignment_performance_parameters = args.diamond_taxonomy_assignment_performance_parameters) File "/singlem/bin/../singlem/pipe.py", line 55, in run otu_table_object = self.run_to_otu_table(**kwargs) File "/singlem/bin/../singlem/pipe.py", line 267, in run_to_otu_table known_taxes, known_otu_tables, include_inserts) File "/singlem/bin/../singlem/pipe.py", line 325, in _find_and_extract_reads_by_hmmsearch search_result = self._search(hmms, forward_read_files, reverse_read_files) File "/singlem/bin/../singlem/pipe.py", line 847, in _search run(hmms, graftm_protein_search_directory, True) File "/singlem/bin/../singlem/pipe.py", line 835, in run extern.run(cmd) File "/opt/conda/envs/env/lib/python3.6/site-packages/extern/init.py", line 41, in run raise ExternCalledProcessError(process, command) extern.ExternCalledProcessError: Command graftM graft --verbosity 2 --input_sequence_type nucleotide --min_orf_length 96 --filter_minimum 28 --threads 8 --forward /cromwell_root/bf-20190529-uswest2-s3/cromwell-execution/run_singlem/582ffe20-7b3d-43de-81a8-2cc0df1f93ff/call-singlem/tmp.8249a9e0/tmp7fqmo736/prefilter_forward/r1.fna --search_only --search_hmm_files /pkgs/S2.1.ribo

...

returned non-zero exit status 1. STDERR was: b'Traceback (most recent call last):\n File "/opt/conda/envs/env/bin/graftM", line 415, in \n Run(args).main()\n File "/opt/conda/envs/env/lib/python3.6/site-packages/graftm/run.py", line 613, in main\n self.graft()\n File "/opt/conda/envs/env/lib/python3.6/site-packages/graftm/run.py", line 388, in graft\n diamond_db\n File "/opt/conda/envs/env/lib/python3.6/site-packages/graftm/timeit.py", line 10, in timed\n result = method(*args, **kw)\n File "/opt/conda/envs/env/lib/python3.6/site-packages/graftm/sequence_searcher.py", line 851, in aa_db_search\n hit_reads_orfs_fasta)\n File "/opt/conda/envs/env/lib/python3.6/site-packages/graftm/sequence_searcher.py", line 943, in search_and_extract_orfs_matching_protein_database\n hits\n File "/opt/conda/envs/env/lib/python3.6/site-packages/graftm/sequence_searcher.py", line 534, in _extract_from_raw_reads\n extern.run(extract_cmd, stdin=\'\n\'.join(input_reads))\n File "/opt/conda/envs/env/lib/python3.6/site-packages/extern/init.py", line 41, in run\n raise ExternCalledProcessError(process, command)\nextern.ExternCalledProcessError: Command mfqe --output-uncompressed --fasta-read-name-lists /dev/stdin --input-fasta <(cat \'/cromwell_root/bf-20190529-uswest2-s3/cromwell-execution/run_singlem/582ffe20-7b3d-43de-81a8-2cc0df1f93ff/call-singlem/tmp.8249a9e0/tmp7fqmo736/prefilter_reverse/r2.fna\') --output-fasta-files \'/cromwell_root/bf-20190529-uswest2-s3/cromwell-execution/run_singlem/582ffe20-7b3d-43de-81a8-2cc0df1f93ff/call-singlem/tmp.8249a9e0/_raw_extracted_reads.fa3bu162jc\' returned non-zero exit status 101.\nSTDERR was: b"[2022-07-18T21:56:43Z INFO mfqe] Read in 1997 read names from /dev/stdin\n[2022-07-18T21:56:43Z INFO mfqe] Iterating input FASTQ file\nthread \'main\' panicked at \'called Result::unwrap() on an Err value: UnexpectedEnd { line: 7183 }\', src/main.rs:316:25\nnote: run with RUST_BACKTRACE=1 environment variable to display a backtrace\n"STDOUT was: b\'\'\n'STDOUT was: b''

wwood commented 2 years ago

Hi @bfoster-lbl thanks for the report. I think the short answer here is that you are using a version that is a bit old.

I guess you are using a docker version? We have an updated one with GTDB r202 annotations right now, and anticipate having an r207 version (and a better method for assigning taxonomy to OTUs) within the next few weeks.

How would you like to proceed?

aclum commented 2 years ago

We are currently using docker image wwood/singlem:0.13.2-dev11.a6cc1b4. Is there a docker image or github release for a current stable version? A docker image we can pull from dockerhub would be preferable. We are trying to run this on all the metagenome datasets we generate, several thousand per year but need fewer failures for it to be implemented in production.

wwood commented 2 years ago

Hi, unfortunately there is no stable release yet. We anticipate having this within the next month or two, but we are wanting to make some changes to the way OTUs are assigned taxonomy first (this will finally merge ~350 commits from the dev branch into main).

There is an updated docker image at public.ecr.aws/m5a0r7u5/singlem-wdl:0.13.2-dev37.e97d171 which I believe is publicly available. We have used that image extensively on thousands of public metagenomes. However, there is a known performance bug in it where the taxonomy assignment step takes longer (triple the time?) than it should. If you want to use this let me know and I can provide an example command line invocation.

The specific error you are seeing here is new to me, btw, but looks like an error with a fastx file being unexpectedly truncated. Hopefully an updated version will make that go away though.

Really happy to see this still in testing at JGI - apologies things havne't stablised as quickly as hoped.

bfoster-lbl commented 1 year ago

Hi Ben, Is there any progress for the new release?

wwood commented 1 year ago

Hi Brian,

It has taken a bit longer than anticipated sorry, but we are still working on this. We have updated r207 reference data and have mostly finished the new algorithm dev - now just gathering the pieces together to make it usable for others and getting tests to pass etc. Hope to have a beta release by the end of this week. After ~500 commits, will be momentous to merge dev back into main...

I wonder if would make sense to catch up quickly over zoom after that to discuss what specifically you are looking for out of this tool? It addresses a few related problems.

Thanks, ben

aclum commented 1 year ago

Hi Ben, Briefly we'd like to run singlem for taxonomic analysis on all metagenomic datasets generated at jgi, so approximately 3,000 datasets a year. I believe Simon Roux has been in contact with you about this previously. Simon, Brian and I all work together and are trying to achieve the same same goal. Let us know if you'd like to set up a meeting to discuss further. Thanks, Alicia

wwood commented 1 year ago

Hi Alicia,

OK, makes sense. That scale shouldn't be a problem - the tool is not particularly RAM or CPU intensive and we've run at larger scales already.

There are some perhaps more advanced use cases that might be of interest e.g. Relating recovered genomes to raw reads to see how many/which were assembled/binned, or updating when new taxonomic reference data emerge, estimates of microbial alpha diversity etc. but maybe we can leave that discussion for after you've had a chance to test out the taxonomic profiling.

Be in touch about a docker you can try out.

ben

bfoster-lbl commented 1 year ago

Hi Ben, we are currently using the container with label "wwood/singlem:0.13.2-dev11.a6cc1b4" is there a newer version?

On Tue, Jul 19, 2022 at 12:57 PM Ben J Woodcroft @.***> wrote:

Hi @bfoster-lbl https://github.com/bfoster-lbl thanks for the report. I think the short answer here is that you are using a version that is a bit old.

I guess you are using a docker version? We have an updated one with GTDB r202 annotations right now, and anticipate having an r207 version (and a better method for assigning taxonomy to OTUs) within the next few weeks.

How would you like to proceed?

— Reply to this email directly, view it on GitHub https://github.com/wwood/singlem/issues/90#issuecomment-1189494537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5AZAU3FZGNUMSCERNWYPTVU4CCVANCNFSM54AUWLOQ . You are receiving this because you were mentioned.Message ID: @.***>

wwood commented 1 year ago

Hi @bfoster-lbl @aclum

I just pushed 1.0.0beta1 to a GitHub tag and docker

docker pull wwood/singlem:1.0.0beta1

There is also some doco at https://wwood.github.io/singlem/

Let me know how you go - certainly interested in this use-case. I don't know of any major bugs in it right now, let me know if you come across any. Nearing a 1.0.0 release but since that was a merge of 500+ commits into main just taking it slow.

Actually, just found a small one - --full-help doesn't work in the docker because man isn't installed. You can get that same info from the online doco https://wwood.github.io/singlem/ though

Thanks for your patience.

ben

wwood commented 1 year ago

Hi again,

That small bug is now fixed, and a new docker is available at wwood/singlem:1.0.0beta2

I'm going to close this issue for now, since it is (presumably) fixed in this new version. If not, or if you encounter other issues, let me know.

Thanks, ben

bfoster-lbl commented 1 year ago

Hi Ben, Is singlem production ready? Is there a non-beta version? Thanks, Brian

On Sun, Oct 16, 2022 at 5:39 PM Ben J Woodcroft @.***> wrote:

Closed #90 https://github.com/wwood/singlem/issues/90 as completed.

— Reply to this email directly, view it on GitHub https://github.com/wwood/singlem/issues/90#event-7597300709, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5AZAU2KCF2N54ASDIYWSTWDSN45ANCNFSM54AUWLOQ . You are receiving this because you were mentioned.Message ID: @.***>

wwood commented 1 year ago

Hi Brian, A difficult question to ask a mere bioinformatician!

I would say that the pipe subcommand, which takes an input metagenome and spits out a taxonomic profile, is in a good place. The pipe mode inside docker image is even tested before pushing to dockerhub.

Some of the other subcommands e.g. supplement, which was just introduced in beta7, are ready for outside testing and do fine in my hands, but have UI issues that need to be fixed (e.g. not all the dependencies for that mode are included in the docker).

I have a draft of the paper that I'm putting finishing touches to, and SingleM does very well in the benchmarking, particularly when the species aren't currently represented in the reference database, which is a situation I imagine is seen very often at JGI. I intend releasing 1.0 non-beta when I push it to biorxiv, if not earlier.

HTH - of course feedback welcome.

wwood commented 5 months ago

Hi @bfoster-lbl @aclum Version v0.16.0 I would consider stable. There is a biorxiv now too - https://www.biorxiv.org/content/10.1101/2024.01.30.578060v1

That isn't to say there won't be issues and changes around the fringes, but the main workflow is set now. Please feel free to raise further issues or get in touch directly if helpful.

bfoster-lbl commented 5 months ago

Thanks! ... I will check it out.

On Tue, Mar 5, 2024 at 6:42 PM Ben J Woodcroft @.***> wrote:

Hi @bfoster-lbl https://github.com/bfoster-lbl @aclum https://github.com/aclum Version v0.16.0 I would consider stable. There is a biorxiv now too - https://www.biorxiv.org/content/10.1101/2024.01.30.578060v1

That isn't to say there won't be issues and changes around the fringes, but the main workflow is set now. Please feel free to raise further issues or get in touch directly if helpful.

— Reply to this email directly, view it on GitHub https://github.com/wwood/singlem/issues/90#issuecomment-1979976875, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5AZAQJUJ6S6YZZ6AG26KTYWZ7AZAVCNFSM54AUWLO2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXHE4TONRYG42Q . You are receiving this because you were mentioned.Message ID: @.***>