Closed sahilrishav2 closed 5 months ago
Hello, what's the content in out/test//metaphlan//metaphlan_output.txt
, is it empty?
Or, could you delete the folder out/
and try it again?
i already deleted the out dir and try again but same error shown again
This is the output of metaphlan_output.txt file, only one species found:
cat test//metaphlan//metaphlan_output.txt
mpa_vJun23_CHOCOPhlAnSGB_202307
/home/rishav/.local/bin/metaphlan test_1.fq.gz,test_2.fq.gz --input_type fastq --bowtie2db /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases -x mpa_vJun23_CHOCOPhlAnSGB_202307 --tax_lev s --bowtie2_exe bowtie2 --nproc 10 --bowtie2out out/test//metaphlan//bowtie.out.bz2
515224 reads processed
SampleID Metaphlan_Analysis
clade_name NCBI_tax_id relative_abundance additional_species
s__Escherichia_coli 562 100.0
The metaphlan result seems correct. Could you just run the original test script, which will download the reference which I used?
Or, could you copy the reference and put it in the same folder of script/
. Also, I will download the reference used by you and test it.
Ok, I download the reference which you used. The reference you used is of Oct22 hence i was trying to use the latest version reference, i.e., jun2023 .
Hello, the ref I used is mpa_vOct22_CHOCOPhlAnSGB_202403
, which should be the latest as it is Mar 2024.
Ok , sorry then, may be mistake from my mistake
But if I see the updation date, there is not much difference, mpa_vOct22_CHOCOPhlAnSGB_202403
updated on 2024-04-05 while mpa_vJun23_CHOCOPhlAnSGB_202403
updated on 2024-03-11
Ok then, I am downloading mpa_vJun23_CHOCOPhlAnSGB_202403
.
Thank you for your support. I am also downloading mpa_vOct22_CHOCOPhlAnSGB_202403 to see if the issue could be resolved.
Hello, I downloaded it, and after building an index, it seems like this:
mpa_vJun23_CHOCOPhlAnSGB_202403.pkl mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.fna
mpa_vJun23_CHOCOPhlAnSGB_202403.tar mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.rev.1.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.1.bt2l mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.rev.2.bt2l
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.2.bt2l mpa_vJun23_CHOCOPhlAnSGB_202403_VINFO.csv
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.3.bt2l mpa_vJun23_CHOCOPhlAnSGB_202403_VSG.fna.bz2
mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.4.bt2l
The mpa_vJun23_CHOCOPhlAnSGB_202403.pkl
file has different index from mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.fna.
This will lead to an error in running Metaphlan4. Therefore, I renamed mpa_vJun23_CHOCOPhlAnSGB_202403.pkl
to mpa_vJun23_CHOCOPhlAnSGB_202403_SGB.pkl
. After that, I run PStrain with the command like:
python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db ../not_use/mpa_vJun23_CHOCOPhlAnSGB_202403 -x mpa_vJun23_CHOCOPhlAnSGB_202403_SGB
The PStrain runs successfully with this command.
Thank you, i also be able to run successfully with this command python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db ../mpa_vOct22_CHOCOPhlAnSGB_202403/ -x mpa_vOct22_CHOCOPhlAnSGB_202403 --proc 20 --nproc 20
This is the output strain_RA.txt file:
Species Species_RA Strain_ID Strain_Freq Strain_RA
sEscherichia_coli 100.0 str-1 0.25 25.0 s__Escherichia_coli 100.0 str-2 0.321429 32.1429 sEscherichia_coli 100.0 str-3 0.428571 42.8571
I need to understand the output file. I want to know how could i get Genbank ids of these "str1", "str2", "str3". so that i could know which strain of E.coli is this.
Thank you
No, I don't have any other query, I just wanted to know the tax ids or GenBank ids of each strain so that I could know the exact strain details. Thank you
Hi Shuai WANG,
Sorry to disturb you again but through uniref ids it is little bit difficult to trace the strain names
# Gene Locus Ref Alt str-1 str-2 str-3
UniRef90_P75933|1__4|SGB10068 318 G A 0 0 1
UniRef90_P75933|1__4|SGB10068 332 A G 0 0 1
UniRef90_P75933|1__4|SGB10068 507 T C 0 0 1
UniRef90_P75933|1__4|SGB10068 558 G A 0 0 1
UniRef90_Q0T2M6|4__9|SGB10068 88 A G 1 1 1
UniRef90_Q0T2M6|4__9|SGB10068 136 T C 1 0 0
UniRef90_Q0T2M6|4__9|SGB10068 259 T C 1 0 1
UniRef90_Q0T2M6|4__9|SGB10068 281 T G 1 1 1
UniRef90_Q0T2M6|4__9|SGB10068 304 C T 0 1 0
Is there another simple way to do that as doing blast of sequences of some uniref genes did not give the expected output
To do this, we need to edit the reference marker genes at the SNV locus. For example, for str-1
, we should convert the base A
to G
at the pos 88
in the marker gene UniRef90_Q0T2M6|4__9|SGB10068
. We should perform this step for all the SNV. Then we get the sequence of str-1
(fast format) in marker genes. Then we should map the sequence of str-1
to the Genbank database to get ids.
Ok, thank you so much but this is the test dataset and here we have only 3 strains, so, we could do this step but when I performed analysis on real datasets, and suppose many strains got identified by the tool so there doing this step would become difficult. I think so..
Of course. So my partner has built a pipeline to achieve this. But the pipeline only supports metaphlan2 now. She needs some time to edit it to support Metaphlan3/4. We will add this function before next week.
Ok, thank you so much for your consistent support.
@sahilrishav2 hi, please refer to this section
Hi @yiqijiang17 , thank you
Hi @yiqijiang17 , I am using the PStrain-tracer tool perl src/PT-07-detect.v2.pl -WDR /home/rishav/PStrain/test/out/test/result/seq -S s__Escherichia_coli -V M4 -I /home/rishav/PStrain/test/out/test/result/seq/s__Escherichia_coli_seq.txt -N 20 -DBS /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.species_markers.txt.gz -DBM /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.fna
but it is giving some unexpected error
Script directory: /home/rishav/PStrain/PStrain-tracer/src
Working directory: /home/rishav/PStrain/test/out/test/result/seq/find_strain/s__Escherichia_coli
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 178.
mkdir: missing operand
Try 'mkdir --help' for more information.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 45.
sh: 1: cannot create /s__Escherichia_coli.list: Permission denied
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 46.
readline() on closed filehandle IN at src/PT-07-detect.v2.pl line 49.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 54.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 80, <$fh> line 7339972.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 81, <$fh> line 7339972.
mkdir: cannot create directory ‘/dl/’: Permission denied
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 83, <$fh> line 7339972.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 84, <$fh> line 7339972.
mkdir: cannot create directory ‘/snp/’: Permission denied
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 96, <$fh> line 7339972.
cat: /s__Escherichia_coli.list: No such file or directory
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.
Total genomes of s__Escherichia_coli: 0
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 153, <DB> line 311481.
print() on closed filehandle SH1 at src/PT-07-detect.v2.pl line 153, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
print() on closed filehandle SH2 at src/PT-07-detect.v2.pl line 154, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 158, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 159, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 159, <DB> line 311481.
print() on closed filehandle SH3 at src/PT-07-detect.v2.pl line 159, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 160, <DB> line 311481.
Use of uninitialized value $tmp_dir in concatenation (.) or string at src/PT-07-detect.v2.pl line 160, <DB> line 311481.
print() on closed filehandle SH3 at src/PT-07-detect.v2.pl line 160, <DB> line 311481.
If you could guide then it would be of great help. Thank You
Hello,
After creating this directory find_strain/s__Escherichia_coli
, it completed like this:
perl src/PT-07-detect.v2.pl -WDR /home/rishav/PStrain/test/out/test/result/seq -S s__Escherichia_coli -V M4 -I /home/rishav/PStrain/test/out/test/result/seq/s__Escherichia_coli_seq.txt -N 20 -DBS /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.species_markers.txt.gz -DBM /home/rishav/PStrain/mpa_vOct22_CHOCOPhlAnSGB_202403/mpa_vOct22_CHOCOPhlAnSGB_202403.fna
Script directory: /home/rishav/PStrain/PStrain-tracer/src
Working directory: /home/rishav/PStrain/test/out/test/result/seq/find_strain/s__Escherichia_coli
Warning: working directory /home/rishav/PStrain/test/out/test/result/seq/find_strain/s__Escherichia_coli exists.
Total genomes of s__Escherichia_coli: 26859
and produces output like this:
1.dl.sh 2.snp.sh 3.tree.sh dl s__Escherichia_coli.list s__Escherichia_coli.marker.fa snp
but it does not produce tree.nwk
and *sorted_distance.txt
file
@sahilrishav2 you should then run shell step by step, I've written this in the README, and highlighted now.
Ok, thank you. I check.
Thank you @yiqijiang17. Now, i get it. These 1.dl.sh
, 2.snp.sh
and 3.tree.sh
are the shell scripts that i had to run after the above command. Yaa, now the files get generated. Thank You for your time and consideration.
Hi Shuai WANG,
I am using this code
python3 ../scripts/PStrain.py -c config.txt -o out --bowtie2db /home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases -x mpa_vJun23_CHOCOPhlAnSGB_202307 --proc 20 --nproc 20
to run test file bu the error still persists:this is the location
/home/rishav/.local/lib/python3.10/site-packages/metaphlan/metaphlan_databases
this is the content of
mpa_vJun23_CHOCOPhlAnSGB_202307.fna
Thank you for your persistence