args_oap is failing to run on a customised db

alanxelena commented 1 year ago

Hi everyone!

I'm trying to run args_oap using a custom database (PlasmidFinder). I downloaded the db and indexed the .fasta file using: args_oap make_db -i ~/path/to/db/plasmid_db.fasta

Now I'm trying to run the first stage as follows

args_oap stage_one -i Fasta_containing_directory -o output -f fa -t 20 --database ~/path/to/db/plasmid_db.fasta

but I get the following error:

Traceback (most recent call last): File "/home/service/miniconda3/bin/args_oap", line 33, in sys.exit(load_entry_point('args-oap==3.2.2', 'console_scripts', 'args_oap')()) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/args_oap.py", line 97, in main options.func(options) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/stage_one.py", line 204, in run_stage_one StageOne(vars(options)).run() File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/stage_one.py", line 184, in run self.extract_seqs(file) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/stage_one.py", line 161, in extract_seqs subp = subprocess.run(['bwa', 'mem', '-t', str(self.thread), self._db, file], check=True, capture_output=True) File "/home/service/miniconda3/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['bwa', 'mem', '-t', '20', '/home/service/data/Databases/plasmid_database.fasta', 'Fasta_RNA/SRR15242155_1.fa']' returned non-zero exit status 1.

I'm sure it has to do with the customised DB because if I run the same line without the --database bit it runs normally. I've already removed any "odd" characters from the fasta headers, I've only kept the underscores ( _ ). I would really appreciate if you could explain me what's going on here.

Thanks a lot!

xinehc commented 1 year ago

Hi,

Can you confirm the directory of ~/path/to/db/plasmid_db.fasta contains many auxiliary files and one file ends with .bwt?

alanxelena commented 1 year ago

Hi!

Thanks for such a quick response. In the mentioned directory I have the following files:

plasmid_database.fasta.sa plasmid_database.fasta.pac plasmid_database.fasta.nto plasmid_database.fasta.ntf plasmid_database.fasta.nsq plasmid_database.fasta.not plasmid_database.fasta.njs plasmid_database.fasta.nin plasmid_database.fasta.nhr plasmid_database.fasta.ndb plasmid_database.fasta.bwt plasmid_database.fasta.ann plasmid_database.fasta.amb plasmid_database.fasta

the .fasta and .fasta.bwt are 58 and 55 KB respectively, the rest ranges from 1 to 20 KB. I don't have .swp files.

Hope that helps and thanks a lot for the assistance.

EDIT: It seems to be working now though I don't want to jinx it, will update in a while if things are indeed working as they should. Just for future reference I had opened the original fasta file in windows to edit the weird symbols out and it seems that it left a CR LF line break in each line, I ran the file through dos2unix and built the DB again.

EDIT 2: I called victory too soon, now I'm getting

Traceback (most recent call last): File "/home/service/miniconda3/bin/args_oap", line 33, in sys.exit(load_entry_point('args-oap==3.2.2', 'console_scripts', 'args_oap')()) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/args_oap.py", line 97, in main options.func(options) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/stage_one.py", line 204, in run_stage_one StageOne(vars(options)).run() File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/stage_one.py", line 184, in run self.extract_seqs(file) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.2-py3.9.egg/args_oap/stage_one.py", line 162, in extract_seqs subsubp = subprocess.run(['samtools', 'fasta', '-F4', '-F0x900', '-'], check=True, capture_output=True, input=subp.stdout) File "/home/service/miniconda3/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['samtools', 'fasta', '-F4', '-F0x900', '-']' returned non-zero exit status 1.

xinehc commented 1 year ago

That's strange, do you mind sharing some sequences of your database file?

You may consider using

seqkit head -n 100 plasmid_db.fasta > test.fa

to get the top 100 sequences.

alanxelena commented 1 year ago

Hi!

Yes, it certainly is, I only downloaded the DB from the DTU website, no changes or whatsoever other than replacing special characters for _. Anyhow, here are the first few seqs. first_100.txt Thanks again!

xinehc commented 1 year ago

Hi,

I could not reproduce the issue on OSX using the following command:

args_oap make_db -i first_100.txt
echo '>level1' | cat - first_100.txt | grep '^>' | cut -d ' ' -f 1 | cut -c2- > structure.txt

args_oap stage_one -i input -o test --database first_100.txt -f fa
args_oap stage_two -i test --database first_100.txt --structure1 structure.txt

The problem may be OS related, are you working on Windows?

alanxelena commented 1 year ago

Hi!

I don't know what's going wrong here, I'm running Ubuntu 20.04.6. I copied your exact command just substituting the first_100.txt for the complete db and get the same issue.

Edit/Update: I tried running it with the first_100 as database and so far it has processed the first sample and is currently through the second one (It had never reached this point before), hopefully it keeps on running. I'm guessing there's something wrong with the rest of the file but I can't detect what, I'm attaching it in case you would like to take a look at it. plasmid_database.txt On other things, I've also noted that if I set -t to 20 instead of the default 8, even using the first_100 db fails, could this be related?

Many thanks!

xinehc commented 1 year ago

It is likely caused by caching large sam files into memory. This patch should fix the issue: https://github.com/xinehc/args_oap/releases/tag/v3.2.3

alanxelena commented 1 year ago

Hi!

Thanks for the help and the new release, it now works using -t 20 which is great. On the downside, I keep getting

Traceback (most recent call last): File "/home/service/miniconda3/bin/args_oap", line 33, in sys.exit(load_entry_point('args-oap==3.2.3', 'console_scripts', 'args_oap')()) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/args_oap.py", line 97, in main options.func(options) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/stage_one.py", line 210, in run_stage_one StageOne(vars(options)).run() File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/stage_one.py", line 190, in run self.extract_seqs(file) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/stage_one.py", line 168, in extract_seqs subprocess.run(['samtools', 'fasta', '-F4', '-F0x900', _tmp_seqs_sam], check=True, stderr=subprocess.DEVNULL, stdout=f) File "/home/service/miniconda3/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['samtools', 'fasta', '-F4', '-F0x900', '115_test/SRR15242155_1.seqs.sam.tmp']' returned non-zero exit status 1.

I did some tests running the same command with the first 100, 110, 115 and so on. Using the first 100 and 110 entries of the database works fine (Here attached: 110_plasmid.txt ) but it starts failing with the 115 database (Here: 115_plasmid.txt ) It is definitely something with the database, and risking people having the same problems with larger databases, it would be great if you could identify what's wrong with these last 5 entries that make everything fail, maybe to add a note of things to take into consideration while building a DB. I tried looking at everything but it seems in place to me.

Thanks again!

xinehc commented 1 year ago

Hi,

it seems your database contains a duplicated entry Col_BS512__1__NC_010656. After deduplication via seqkit rmdup the pipeline should finish.

alanxelena commented 1 year ago

I'll be damned, it was that easy. Thanks for the support, it seems to be running now even though with a warning

[2023-06-28 14:52:41] [1m[33;20mWARNING[1;0m: Output folder contains and/or , they/it will be overwritten.

I know it's just a warning but there was originally no directory called "Replicon" so there's no way either extracted.fa or metadata.txt were in there. Anyhow, I'll keep you posted.

Many many thanks again.

alanxelena commented 1 year ago

Hi!

It's me again. So I'm trying to go for another custom made DB, this time IS IS_db.txt

And it fails again:

Traceback (most recent call last): File "/home/service/miniconda3/bin/args_oap", line 33, in sys.exit(load_entry_point('args-oap==3.2.3', 'console_scripts', 'args_oap')()) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/args_oap.py", line 97, in main options.func(options) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/stage_one.py", line 210, in run_stage_one StageOne(vars(options)).run() File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/stage_one.py", line 190, in run self.extract_seqs(file) File "/home/service/miniconda3/lib/python3.9/site-packages/args_oap-3.2.3-py3.9.egg/args_oap/stage_one.py", line 165, in extract_seqs subprocess.run(['bwa', 'mem', '-t', str(self.thread), '-o', _tmp_seqs_sam, self._db, file], check=True, stderr=subprocess.DEVNULL) File "/home/service/miniconda3/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['bwa', 'mem', '-t', '20', '-o', 'IS_prediction/SRR15242155_1.seqs.sam.tmp', '/home/service/data/Databases/IS_db/IS_db.fasta', 'Fasta_RNA/SRR15242155_1.fa']' returned non-zero exit status 1.

I already removed weird characters and duplicates using seqkit. I feel that it will be something small like last time but it's difficult for me to detect the specific issue. I really appreciate all the support, many thanks!

xinehc commented 1 year ago

Your file need a trailing newline character '\n' otherwise its not a standard fasta format.

Use

seqkit seq IS_db.txt > IS_db.fa

or manually add a newline at the end of the file solves the problem.

alanxelena commented 1 year ago

Hi!

Thanks for the reply, sadly that doesn't solve the problem. I'm getting the exact same Traceback :(

xinehc commented 1 year ago

It seems your file has another issue: there is a malformed line in record ISH18_IS200_IS605_IS605. You may consider manually remove the newline char or use seqkit sana to remove the broken record.

seqkit sana -i fasta IS_db.txt > IS_db.fa
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63819:   >ISH18_IS200_IS605_IS605
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63820:   CTTGATTCAGCGAGAGAATCCCGCCCTTCCCGTGAGTGACGAGTGTTCCGACGCTCAGTC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63821:   GGAACCGAGGAGCGAACAGGGCGGGAGTGAATCGCGTAAGCTCTGTGCAACAAACCACTC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63822:   GGTTTCTCCCTGCTGAATACCCCACGTCACAAGAGTTATCTGAATTGGGTGTCTCGTATC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63823:   TGCTAAGGCCAAATGGAGTATCATCTGCAAACCGGGTCGCACACAGTGTACGCGCTTCAG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63824:   TACCACTTCGTGACTGTCACGAAGTACCGGGCCGATATTCTCACGTATGAGCGGCTGGAG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63825:   CGTGTGGCTGAAATCGCCCACGATATTGCAGACGACTTCGAGGCCGACATCAAGAACGTG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63826:   GACGGTGGTACCGACCACGTTCACATCCTGTTTCAGACCAAACCAACCACAGACCTCACG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63827:   AAGTTCATCAACTCACTCAAAGGTGTCACGTCCCGACGGATACGGTCGGAGTTTCCCGAA
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63828:   GTAACACAGACGCTCGAAGATGCGTTCTGGCAACCGGGATACTTCCTCGCCACGACCGGC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63829:   CAAGTGAGCATTGACACGCTGATGGACTACGTGGACGACCAGTAGCATGACCGCGACAAC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63830:   CACAAAAACGCTGGAAGCTACACTCGCCCCGCCGACAGCCCACAAAGAGCGGAAACTGTG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63831:   TGACCTGCTCGAAACCTACCGTGAGGGGCTACACGAAGCGTTCGACGCGGGTGTGACACG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63832:   ATGACCGCCACAGGACGTGGTGACGCCCTACGACCTCCCGTATCAGGCGAAAGAGCTCTC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63833:   TGCAACTACGTCCCACAACTGCACGACACCTACAACGCACAGGAGTTAGACGACGACCAC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63834:   CCGGTTCGGCTTACCAACCAAGCCGCCGAGTTCGACCACTCGGCGGCGCGTGACTACGAG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63835:   TTCACATGGTGGGCACCGCAACCCGGTCGCGGGACGAATTTCTGGATACCGCTTCGTATC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63836:   AACCCCGAACAAGACGGTCTGTGGCACGACCTCGTACACGGTGAGGCGTCGGCAGGCCAA
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63837:   CTCCGCCTGCAACGCCACCGCACGTCGTGGACGCTCCACGTCACTGTCGAGTTCCCGGTC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63838:   GAACAACCGGACTATGAGCCGACCGACGAGGATGTGACGCCAGTCGGCTTTGATATTGGC
CTGACCCACTGCTCATCb.txt  Discarded line: Invalid line structure! 63839:   GAAGCACCTGCTCGCGGGCTGTGTGCAAGCAGGGCACTCCGA
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63840:   AACGGCGGCCGCGCTCGTCACCTCCGCAAAGAGATGTTCACAACGCTGAAGCGACTGCAA
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63841:   GAGCGTGACGCCGCCCAGTGGCGGATTGACGAGCGATTCGACCACTACCAGAACGCACTC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63842:   ACAGACATCATCGAAAAGGCGTCTCGGCAAGCAATCGAGTACGCCTGCCGATTCGAGAAG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63843:   CCTGTGGTCGTTCTGGAAGACCTCTCGTACATCCGCGAAGACCTCGACTACGGCGAATGG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63844:   ATGAACCGACGCCTCCACGCATGGGCGTTCGCTCGCTTGCAGGAGCGTATCGAGGACAAA
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63845:   GCACGAGAGGCTGGCATCCCGGTCGAATACATTCGCCCGGAGTACACGAGGCAGACGTGC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63846:   CACGAGTGCGGCCACATCGGGTATCGGGACGGCGATGAGTTCCGGTGTCAGAACGACGAG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63847:   TGTTGGGTATCGGAGTACCACGCAGACATCAACGCGGCGGTCAACATCGCTGACCGCCAC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63848:   GACCCGTGGGGTGAGAGCCTGCCGCTGAAACCGGCGGGCGATGACATCTCACGGGATGGG
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63849:   AGCGCCTGTGACAGCGCCGCGACCCCCACCGAGCAGAGCCAACCACGGCAGATGACGCTC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63850:   GGAGAGGTCGGGTCGGAACCCACTGCCGGTCGTTAACCGGTATTCCCATGCAAGGAAGCC
[INFO] File: IS_db.txt  Discarded line: Invalid line structure! 63851:   GCGCCGTTCACGGCGCGGAGGATGTCAC

alanxelena commented 1 year ago

Wow, how many problems can a fasta file have? Thanks for you patience and knowledge! It's now running, hope it keeps on like this.

Many many many many many many many thanks!

alanxelena commented 1 year ago

Hey! It's me again after a while of inactivity.

Stage one ran smoothly after the last suggestions, I'm anyhow having problems with the second stage now.

This is the error I get:

[2023-07-14 12:04:54] WARNING: Not all extracted target sequences can be found in the structure file. Please check the database and/or the structure files. [2023-07-14 12:04:54] CRITICAL: No target sequence remained after merging structure files, no further normalization will be made.

The structure was done as suggested:

echo '>level1' | cat - IS_db.fasta | grep '^>' | cut -d ' ' -f 1 | cut -c2- > structure.txt

The used database for generating the structure was the one I used for stage one.

Any suggestions on what might be the cause now?

Thanks a lot!

xinehc commented 1 year ago

Using the provided STAS example files I am able to get some results with:

seqkit sana -I fasta -O fasta IS_db.txt > IS.fa
args_oap make_db -i IS.fa
echo '>level1' | cat - IS.fa | grep '^>' | cut -d ' ' -f 1 | cut -c2- > IS.txt

args_oap stage_one -i input -o test --database IS.fa -f fa
args_oap stage_two -i test --database IS.fa --structure1 IS.txt

If you get the warning it is likely your structure file is malformed. It should be like:

level1
ISLL6_IS3_IS3
IS100_IS21_unknown
IS1000A_IS110_unknown
IS1000B_IS110_unknown
IS1001_ISL3_unknown
IS1002_IS481_unknown
IS1004_IS200_IS605_IS200
IS1006_IS6_unknown
IS1007_IS6_unknown
IS1008_IS6_unknown
IS100kyp_IS21_unknown
IS100L_IS21_unknown
IS100X_IS21_unknown
IS1016_IS1595_IS1016
IS1016C_IS1595_IS1016
IS1016C2_IS1595_IS1016
...

xinehc / args_oap

args_oap is failing to run on a customised db #29