wshuai294 / PStrain

Pstrain profiles strains in metagenomics data. It infers strain abundance and genotype for each species. Also, it has a single species mode; where given a BAM and VCF, it can phase the variants for any species.
MIT License
8 stars 2 forks source link

Error: File truncated at line 1 Could not build fai index out/test//ref//merged_ref.fa.fai #14

Closed sahilrishav2 closed 5 months ago

sahilrishav2 commented 6 months ago

Hi Shuai WANG,

I am using this code python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases -x mpa_vJun23_CHOCOPhlAnSGB_202307 --proc 20 --nproc 20 to run test file bu the error still persists:

Start Profiling test... run metaphlan... read_metaphlan out/test//metaphlan//metaphlan_output.txt [E::fai_build_core] File truncated at line 1 [faidx] Could not build fai index out/test//ref//merged_ref.fa.fai Settings: Output files: "out/test//ref//merged_ref..bt2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: out/test//ref//merged_ref.fa Warning: Empty fasta file: 'out/test//ref//merged_ref.fa' Warning: All fasta inputs were empty Total time for call to driver() for forward index: 00:00:00 Error: Encountered internal Bowtie 2 exception (#1) Command: /home/rishav/anaconda3/envs/pstrain/bin/bowtie2-build-s --wrapper basic-0 -f out/test//ref//merged_ref.fa out/test//ref//merged_ref (ERR): "out/test//ref//merged_ref" does not exist or is not a Bowtie 2 index Exiting now ... [main_samview] fail to read the header from "-". [W::hts_set_opt] Cannot change block size for this format samtools sort: failed to read header from "-" [Mon May 06 11:51:34 IST 2024] picard.sam.AddOrReplaceReadGroups INPUT=out/test//map/mapped.sort.bam OUTPUT=out/test/map/mapped.bam RGLB=whatever RGPL=illumina RGPU=whatever RGSM=whatever RGID=1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json [Mon May 06 11:51:34 IST 2024] Executing as rishav@mjlab-Precision-7920-Tower on Linux 6.5.0-27-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) JdkDeflater [Mon May 06 11:51:34 IST 2024] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file: /home/rishav/PStrain/test/out/test/map/mapped.sort.bam at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:338) at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:325) at htsjdk.samtools.util.IOUtil.assertInputIsValid(IOUtil.java:301) at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:107) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) rm: cannot remove 'out/test//map/mapped.sort.bam': No such file or directory [E::hts_open_format] Failed to open file "out/test//map/mapped.bam" : No such file or directory samtools index: failed to open "out/test//map/mapped.bam": No such file or directory [E::hts_open_format] Failed to open file "out/test//map/mapped.bam" : No such file or directory samtools depth: Cannot open input file "out/test//map/mapped.bam": No such file or directory INFO 11:51:37,029 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:51:37,031 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 11:51:37,032 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 11:51:37,032 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:51:37,034 HelpFormatter - Program Args: -T HaplotypeCaller -R out/test//ref//merged_ref.fa -allowPotentiallyMisencodedQuals -I out/test//map/mapped.bam -o out/test//map/mapped.vcf.gz INFO 11:51:37,036 HelpFormatter - Executing as rishav@mjlab-Precision-7920-Tower on Linux 6.5.0-27-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16. INFO 11:51:37,037 HelpFormatter - Date/Time: 2024/05/06 11:51:37 INFO 11:51:37,037 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:51:37,037 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:51:37,086 GenomeAnalysisEngine - Strictness is SILENT INFO 11:51:39,399 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Fasta index file /home/rishav/PStrain/test/out/test/ref/merged_ref.fa.fai for reference /home/rishav/PStrain/test/out/test/ref/merged_ref.fa does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.
ERROR ------------------------------------------------------------------------------------------

[E::hts_open_format] Failed to open file "out/test//map/mapped.vcf.gz" : No such file or directory No need to merge in case of one sample.

this is the location /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases

mpa_latest
mpa_vJun23_CHOCOPhlAnSGB_202307.3.bt2l mpa_vJun23_CHOCOPhlAnSGB_202307.pkl
mpa_vJun23_CHOCOPhlAnSGB_202307_VINFO.csv mpa_vJun23_CHOCOPhlAnSGB_202307.1.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.4.bt2l mpa_vJun23_CHOCOPhlAnSGB_202307.rev.1.bt2l README.txt mpa_vJun23_CHOCOPhlAnSGB_202307.2.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202307.fna mpa_vJun23_CHOCOPhlAnSGB_202307.rev.2.bt2l

this is the content of mpa_vJun23_CHOCOPhlAnSGB_202307.fna

VDB|003B-0000-0-01C2|M1-c0-c0-c0 240413_kVSG_NC_0274021_Salmonella_phage_SPN3US CTTTTTAAAACAATCACTTAGATTGTTTTTTAGATCGATCCTGTGGATCGGGTCGAAGAC CCTTTCCTTATACGCGTGCGCGTGCGCACACCCGTATCCTCCAGGATCTTTTAAAATATA TAATAATATTAAATATAATAATACCTTTAGGTATTATATTTAATTATTATAATTATTTAT ATAAATAAATATATTAATATATTAAGATCTTAAAACGCGTGCGGGCGCGCGCGTGATCCT TATTAAGAGATCGTGATCGGACAGACGTCCTGGTTTGGGTACCCTGTGTATTTTTTACGT GGTTTCCCTATCGTGTGTTCAGACCACCACTCGATGTCCAAAACCCATAGCTTCGGTGGC CGCCTCCCTGGACGGCAGACTCTGACAAGTCGACCCACACCACCAACCCGTCCGGGGACG GATTTTTATTCCAACGACAAGCTCCACGAAAATCCCCGCAGAGGGACTCACTGCTGTGTG ACTCACCGGCTCCTGGACGCCTACGCCTCGAGCCGGTAATTTTTTGATTACGTGAGTTTG

Thank you for your persistence

wshuai294 commented 6 months ago

Hello, what's the content in out/test//metaphlan//metaphlan_output.txt, is it empty?

wshuai294 commented 6 months ago

Or, could you delete the folder out/ and try it again?

sahilrishav2 commented 6 months ago

i already deleted the out dir and try again but same error shown again

sahilrishav2 commented 6 months ago

This is the output of metaphlan_output.txt file, only one species found:

cat test//metaphlan//metaphlan_output.txt

mpa_vJun23_CHOCOPhlAnSGB_202307

/home/rishav/.local/bin/metaphlan test_1.fq.gz,test_2.fq.gz --input_type fastq --bowtie2db /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases -x mpa_vJun23_CHOCOPhlAnSGB_202307 --tax_lev s --bowtie2_exe bowtie2 --nproc 10 --bowtie2out out/test//metaphlan//bowtie.out.bz2

515224 reads processed

SampleID Metaphlan_Analysis

clade_name NCBI_tax_id relative_abundance additional_species

s__Escherichia_coli 562 100.0

wshuai294 commented 6 months ago

The metaphlan result seems correct. Could you just run the original test script, which will download the reference which I used? Or, could you copy the reference and put it in the same folder of script/. Also, I will download the reference used by you and test it.

sahilrishav2 commented 6 months ago

Ok, I download the reference which you used. The reference you used is of Oct22 hence i was trying to use the latest version reference, i.e., jun2023 .

wshuai294 commented 6 months ago

Hello, the ref I used is mpa_vOct22_CHOCOPhlAnSGB_202403, which should be the latest as it is Mar 2024.

sahilrishav2 commented 6 months ago

Ok , sorry then, may be mistake from my mistake

sahilrishav2 commented 6 months ago

But if I see the updation date, there is not much difference, mpa_vOct22_CHOCOPhlAnSGB_202403 updated on 2024-04-05 while mpa_vJun23_CHOCOPhlAnSGB_202403 updated on 2024-03-11

wshuai294 commented 6 months ago

Ok then, I am downloading mpa_vJun23_CHOCOPhlAnSGB_202403.

sahilrishav2 commented 6 months ago

Thank you for your support. I am also downloading mpa_vOct22_CHOCOPhlAnSGB_202403 to see if the issue could be resolved.

wshuai294 commented 6 months ago

Hello, I downloaded it, and after building an index, it seems like this:

mpa_vJun23_CHOCOPhlAnSGB_202403.pkl         mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.fna
mpa_vJun23_CHOCOPhlAnSGB_202403.tar         mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.rev.1.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.1.bt2l  mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.rev.2.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.2.bt2l  mpa_vJun23_CHOCOPhlAnSGB_202403_VINFO.csv
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.3.bt2l  mpa_vJun23_CHOCOPhlAnSGB_202403_VSG.fna.bz2
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.4.bt2l

The mpa_vJun23_CHOCOPhlAnSGB_202403.pkl file has different index from mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.fna. This will lead to an error in running Metaphlan4. Therefore, I renamed mpa_vJun23_CHOCOPhlAnSGB_202403.pkl to mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.pkl. After that, I run PStrain with the command like:

python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db  ../not_use/mpa_vJun23_CHOCOPhlAnSGB_202403 -x mpa_vJun23_CHOCOPhlAnSGB_202403_SGB

The PStrain runs successfully with this command.

sahilrishav2 commented 6 months ago

Thank you, i also be able to run successfully with this command python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db ../mpa_vOct22_CHOCOPhlAnSGB_202403/ -x mpa_vOct22_CHOCOPhlAnSGB_202403 --proc 20 --nproc 20 This is the output strain_RA.txt file:

Species Species_RA Strain_ID Strain_Freq Strain_RA

sEscherichia_coli 100.0 str-1 0.25 25.0 s__Escherichia_coli 100.0 str-2 0.321429 32.1429 sEscherichia_coli 100.0 str-3 0.428571 42.8571

I need to understand the output file. I want to know how could i get Genbank ids of these "str1", "str2", "str3". so that i could know which strain of E.coli is this.

Thank you

sahilrishav2 commented 6 months ago

No, I don't have any other query, I just wanted to know the tax ids or GenBank ids of each strain so that I could know the exact strain details. Thank you

sahilrishav2 commented 6 months ago

Hi Shuai WANG,

Sorry to disturb you again but through uniref ids it is little bit difficult to trace the strain names

# Gene  Locus   Ref Alt str-1   str-2   str-3   
UniRef90_P75933|1__4|SGB10068   318 G   A   0   0   1   
UniRef90_P75933|1__4|SGB10068   332 A   G   0   0   1   
UniRef90_P75933|1__4|SGB10068   507 T   C   0   0   1   
UniRef90_P75933|1__4|SGB10068   558 G   A   0   0   1   
UniRef90_Q0T2M6|4__9|SGB10068   88  A   G   1   1   1   
UniRef90_Q0T2M6|4__9|SGB10068   136 T   C   1   0   0   
UniRef90_Q0T2M6|4__9|SGB10068   259 T   C   1   0   1   
UniRef90_Q0T2M6|4__9|SGB10068   281 T   G   1   1   1   
UniRef90_Q0T2M6|4__9|SGB10068   304 C   T   0   1   0

Is there another simple way to do that as doing blast of sequences of some uniref genes did not give the expected output

wshuai294 commented 6 months ago

To do this, we need to edit the reference marker genes at the SNV locus. For example, for str-1, we should convert the base A to G at the pos 88 in the marker gene UniRef90_Q0T2M6|4__9|SGB10068. We should perform this step for all the SNV. Then we get the sequence of str-1 (fast format) in marker genes. Then we should map the sequence of str-1 to the Genbank database to get ids.

sahilrishav2 commented 6 months ago

Ok, thank you so much but this is the test dataset and here we have only 3 strains, so, we could do this step but when I performed analysis on real datasets, and suppose many strains got identified by the tool so there doing this step would become difficult. I think so..

wshuai294 commented 6 months ago

Of course. So my partner has built a pipeline to achieve this. But the pipeline only supports metaphlan2 now. She needs some time to edit it to support Metaphlan3/4. We will add this function before next week.

sahilrishav2 commented 6 months ago

Ok, thank you so much for your consistent support.

yiqijiang17 commented 5 months ago

@sahilrishav2 hi, please refer to this section

sahilrishav2 commented 5 months ago

Hi @yiqijiang17 , thank you

sahilrishav2 commented 5 months ago

Hi @yiqijiang17 , I am using the PStrain-tracer tool perl src/PT-07-detect.v2.pl -WDR /home/rishav/PStrain/test/out/test/result/seq -S s__Escherichia_coli -V M4 -I /home/rishav/PStrain/test/out/test/result/seq/s__Escherichia_coli_seq.txt -N 20 -DBS /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.species_markers.txt.gz -DBM /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.fna but it is giving some unexpected error

Script directory: /home/rishav/PStrain/PStrain-tracer/src
Working directory: /home/rishav/PStrain/test/out/test/result/seq/find_strain/s__Escherichia_coli
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 178.
mkdir: missing operand
Try 'mkdir --help' for more information.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 45.
sh: 1: cannot create /s__Escherichia_coli.list: Permission denied
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 46.
readline() on closed filehandle IN at src/PT-07-detect.v2.pl line 49.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 54.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 80, <$fh> line 7339972.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 81, <$fh> line 7339972.
mkdir: cannot create directory ‘/dl/’: Permission denied
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 83, <$fh> line 7339972.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 84, <$fh> line 7339972.
mkdir: cannot create directory ‘/snp/’: Permission denied
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 96, <$fh> line 7339972.
cat: /s__Escherichia_coli.list: No such file or directory
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.
Total genomes of s__Escherichia_coli: 0
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 153, <DB> line 311481.
print() on closed filehandle SH1 at src/PT-07-detect.v2.pl line 153, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
print() on closed filehandle SH2 at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 158, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 159, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 159, <DB> line 311481.
print() on closed filehandle SH3 at src/PT-07-detect.v2.pl line 159, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 160, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 160, <DB> line 311481.
print() on closed filehandle SH3 at src/PT-07-detect.v2.pl line 160, <DB> line 311481.

If you could guide then it would be of great help. Thank You

sahilrishav2 commented 5 months ago

Hello,

After creating this directory find_strain/s__Escherichia_coli, it completed like this:

perl src/PT-07-detect.v2.pl -WDR /home/rishav/PStrain/test/out/test/result/seq -S s__Escherichia_coli -V M4 -I /home/rishav/PStrain/test/out/test/result/seq/s__Escherichia_coli_seq.txt -N 20 -DBS /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.species_markers.txt.gz -DBM /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.fna

Script directory: /home/rishav/PStrain/PStrain-tracer/src
Working directory: /home/rishav/PStrain/test/out/test/result/seq/find_strain/s__Escherichia_coli
Warning: working directory /home/rishav/PStrain/test/out/test/result/seq/find_strain/s__Escherichia_coli exists.
Total genomes of s__Escherichia_coli: 26859

and produces output like this:

1.dl.sh 2.snp.sh 3.tree.sh dl s__Escherichia_coli.list s__Escherichia_coli.marker.fa snp

but it does not produce tree.nwk and *sorted_distance.txt file

yiqijiang17 commented 5 months ago

@sahilrishav2 you should then run shell step by step, I've written this in the README, and highlighted now.

sahilrishav2 commented 5 months ago

Ok, thank you. I check.

sahilrishav2 commented 5 months ago

Thank you @yiqijiang17. Now, i get it. These 1.dl.sh, 2.snp.sh and 3.tree.sh are the shell scripts that i had to run after the above command. Yaa, now the files get generated. Thank You for your time and consideration.