zheminzhou / PEPPAN

Phylogeny Enhanded Prediction of PAN-genome
https://doi.org/10.1101/2020.01.03.894154
GNU General Public License v3.0
39 stars 10 forks source link

Failed to mmap memory dataSize=0 File=./NS_eyo9ogxk/seq.db_h. Error 22. #21

Open ghost opened 3 years ago

ghost commented 3 years ago

Hello, I am new to using these tools, so please excuse me if I don't explain well. I am trying to create a pangenome of Borrelia spp to map my tick microbiome reads against to quantify Borrelia presence in my samples.

I have 11 gff files representing 11 Borrelia species, which I downloaded from NCBI. I have the fasta files too, but I believe the gff format will suffice as input from NCBI, is this correct? My files are: GCF_000512145.1_ASM51214v2_genomic.fna.gz GCF_002741785.1_ASM274178v1_genomic.gff.gz GCF_000512145.1_ASM51214v2_genomic.gff.gz GCF_003606285.1_ASM360628v1_genomic.fna.gz GCF_000956315.1_ASM95631v1_genomic.fna.gz GCF_003606285.1_ASM360628v1_genomic.gff.gz GCF_000165595.2_ASM16559v2_genomic.fna.gz GCF_000956315.1_ASM95631v1_genomic.gff.gz GCF_003814405.1_ASM381440v1_genomic.fna.gz GCF_000165595.2_ASM16559v2_genomic.gff.gz GCF_001936255.1_ASM193625v1_genomic.fna.gz GCF_003814405.1_ASM381440v1_genomic.gff.gz GCF_000181575.2_ASM18157v2_genomic.fna.gz GCF_001936255.1_ASM193625v1_genomic.gff.gz GCF_014525745.1_ASM1452574v1_genomic.fna.gz GCF_000181575.2_ASM18157v2_genomic.gff.gz GCF_001936295.1_ASM193629v1_genomic.fna.gz GCF_014525745.1_ASM1452574v1_genomic.gff.gz GCF_000181895.2_ASM18189v2_genomic.fna.gz GCF_001936295.1_ASM193629v1_genomic.gff.gz GCF_000181895.2_ASM18189v2_genomic.gff.gz GCF_002741785.1_ASM274178v1_genomic.fna.gz

Currently, I get a series of errors when I input the following:

Current Behavior

2021-05-07 12:25:21.570015 COMMAND: /home/sean/.local/bin/PEPPAN -p borrelia_files/BORR -t 4 --clust_identity 0.5 --clust_match_prop 0.6 --match_identity 0.4 borrelia_files/GCF_000165595.2_ASM16559v2_genomic.gff.gz borrelia_files/GCF_000181575.2_ASM18157v2_genomic.gff.gz borrelia_files/GCF_000181895.2_ASM18189v2_genomic.gff.gz borrelia_files/GCF_000512145.1_ASM51214v2_genomic.gff.gz borrelia_files/GCF_000956315.1_ASM95631v1_genomic.gff.gz borrelia_files/GCF_001936255.1_ASM193625v1_genomic.gff.gz borrelia_files/GCF_001936295.1_ASM193629v1_genomic.gff.gz borrelia_files/GCF_002741785.1_ASM274178v1_genomic.gff.gz borrelia_files/GCF_003606285.1_ASM360628v1_genomic.gff.gz borrelia_files/GCF_003814405.1_ASM381440v1_genomic.gff.gz borrelia_files/GCF_014525745.1_ASM1452574v1_genomic.gff.gz 2021-05-07 12:25:22.032943 Run MMSeqs linclust to get exemplar sequences. Params: 0.5 identities and 0.8 align ratio Failed to mmap memory dataSize=0 File=./NS_eyo9ogxk/seq.db_h. Error 22. Traceback (most recent call last): File "/home/sean/.local/bin/PEPPAN", line 8, in sys.exit(ortho()) File "/home/sean/.local/lib/python3.8/site-packages/PEPPAN/PEPPAN.py", line 1884, in ortho params['clust'] = iterClust(params['prefix'], params['genes'], groups, dict(identity=params['clust_identity'], coverage=params['clust_match_prop'], n_thread=params['n_thread'], translate=False)) File "/home/sean/.local/lib/python3.8/site-packages/PEPPAN/PEPPAN.py", line 1784, in iterClust g, clust = getClust(prefix, g, params) File "/home/sean/.local/lib/python3.8/site-packages/PEPPAN/modules/clust.py", line 67, in getClust with open(tabFile) as fin : FileNotFoundError: [Errno 2] No such file or directory: './NS_eyo9ogxk/clust.tab'

Steps to Reproduce (for bugs)

PEPPAN -p borrelia_files/BORR -t 4 --clust_identity 0.5 --clust_match_prop 0.6 --match_identity 0.4 borrelia_files/*.gff.gz

This does generate some output files with my desired prefix: BORR.encode.csv,BORR.genes and BORR.old_prediction.npz

Context

I have been searching online for clues and this was my reasoning behind changing the values for cluster identity and clust match prop and match identity. I changed -t to use fewer threads, in case it was a memory issue.

Environment

details of my environment: To install, I did the following - conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda conda install mmseqs2 conda install blast conda install diamond conda install rapidnj conda install fasttree

command -v mmseqs blastn rapidnj diamond fasttree

/home/sean/miniconda3/envs/peppaninstall/bin/mmseqs /usr/bin/blastn

pip3 install peppan

I ran the test data and it all worked great. I hope this makes sense!

Naclist commented 3 years ago

GFFs from NCBI without preptreatment are not enough for PEPPAN to establish a pangenome for you, read the Quickstart and you will find out a fasta file should be added. Also, you can use the Prokka to deal with your fasta files and generate GFF files with the sequences.