widdowquinn / pyani

Application and Python module for average nucleotide identity analyses of microbes.
http://widdowquinn.github.io/pyani/
MIT License
192 stars 55 forks source link

Hi! If input fasta files more than 6, then an error. #32

Closed jinhuiwang closed 7 years ago

jinhuiwang commented 7 years ago

average_nucleotide_identity.py -i ANI -o ANI_OUT -m ANIblastall -g

Traceback (most recent call last): File "/usr/local/bin/average_nucleotide_identity.py", line 772, in results = methods[args.method][0](infiles, org_lengths) File "/usr/local/bin/average_nucleotide_identity.py", line 550, in unified_anib fraglengths=fraglengths, mode=args.method) File "/usr/local/lib/python3.4/dist-packages/pyani/anib.py", line 372, in process_blast mode) File "/usr/local/lib/python3.4/dist-packages/pyani/anib.py", line 417, in parse_blast_tab data = pd.DataFrame.from_csv(filename, header=None, sep='\t') File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 1189, in from_csv infer_datetime_format=infer_datetime_format) File "/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py", line 562, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py", line 315, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py", line 645, in init self._make_engine(self.engine) File "/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py", line 799, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py", line 1213, in init self._reader = _parser.TextReader(src, **kwds) File "pandas/parser.pyx", line 523, in pandas.parser.TextReader.cinit (pandas/parser.c:5214) pandas.io.common.EmptyDataError: No columns to parse from file

widdowquinn commented 7 years ago

Hi Jinhui,

I don't think this problem is to do with the number of input files. It looks like at least one of your inputs is so distinct from another that there is no similarity being identified by BLAST. The error being thrown is that no data is being read from a BLAST output file.

It would help me diagnose this properly if you could please provide a small (minimal) dataset that reproduces your issue, and/or please inspect the intermediate BLAST output, to see if one of them lacks content.

It could also help if you please run pyani with the -v option, and provide the output log file.

In the meantime, I'll look at making that error catching and reporting a bit more informative.

Thanks,

L.

jinhuiwang commented 7 years ago

Yes, you are right. I excluded two sequences that have very low similarity with others, and it worked!

jinhuiwang commented 7 years ago

Hi! I attached the dataset. The two sequences Lcr_BT1_P1 and Lcr_BT1_P2 are distinct from other sequences. But is it possible to include these two sequences into the blastall output file? ANI.zip

widdowquinn commented 7 years ago

Hi Jinhui,

I have tried running your data with the current version on GitHub, and I don't get an error (see attached log file and output).

Can you please confirm whether you are using pyani 0.2.0.post1 or the development version on GitHub, as I think the development version fixes your issue and records the empty BLAST output results correctly for Lcr_BT1_LC1.fasta and Lcr_BT1_LC2.fasta

I note that the log file doesn't currently report the pyani version, so I'll fix that.

I would also very much recommend using ANIm, rather than ANIblastall.

If you can clone the current master branch and confirm that it works correctly on your data, I'll close the issue.

L.

jinhuiwang commented 7 years ago

Thanks for the suggestion! I use the current version on GitHub, $ git clone from your repository. The ANIm is fine for this dataset but not ANIblastall. And I attach the blastoutput and also log file, no blastall result found in any Lcr_BT1_P1/P2vs*.tab file. I think you are right about choose ANIm for this dataset. blastall_output.zip ANIblastall.log.zip

jinhwang@jinhwang-HP:~/bio_app/pyani$ average_nucleotide_identity.py -i prophages -o prophages_ANIblastall -m ANIblastall -f -g -v --label prophages/labels.tab

(LP: deleted log text for space, as it is present in the ANIblastall.log.zip file.)

widdowquinn commented 7 years ago

I'm a little puzzled. The command line I ran on your data was:

./average_nucleotide_identity.py -v -i tests/issue_10 -o tests/issue_10_output --method ANIblastall -g --gformat png -l issue_10.log

where tests/issue_10 contained the files you attached. This appears to be essentially identical to your command line:

/usr/local/bin/average_nucleotide_identity.py -i prophages -o prophages_ANIblastall -m ANIblastall -f -g -v --label prophages/labels.tab

and also generates a number of empty BLAST output files:

$ ls -ltrS *.blast_tab | head -n 30
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLso_ZC1_P2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLso_ZC1_P1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLso_NZ1_P1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLso_FIN114_phaA.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLas_psy62_FP2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLas_UF506_SC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLas_UF506_SC1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLam_SaoPaulo_SP2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC2_vs_CLaf_PTSAPSY_P1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC1_vs_CLas_psy62_FP2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC1_vs_CLas_UF506_SC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC1_vs_CLas_UF506_SC1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 Lcr_BT1_LC1_vs_CLam_SaoPaulo_SP2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLso_ZC1_P2_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLso_ZC1_P1_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLso_NZ1_P1_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLso_FIN114_phaA_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLas_psy62_FP2_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLas_psy62_FP2_vs_Lcr_BT1_LC1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLas_UF506_SC2_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLas_UF506_SC2_vs_Lcr_BT1_LC1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLas_UF506_SC1_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLas_UF506_SC1_vs_Lcr_BT1_LC1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLam_SaoPaulo_SP2_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLam_SaoPaulo_SP2_vs_Lcr_BT1_LC1.blast_tab
-rw-r--r--  1 lpritc  staff     0B 21 Sep 16:58 CLaf_PTSAPSY_P1_vs_Lcr_BT1_LC2.blast_tab
-rw-r--r--  1 lpritc  staff    64B 21 Sep 16:58 Lcr_BT1_LC1_vs_CLso_NZ1_P1.blast_tab
-rw-r--r--  1 lpritc  staff    64B 21 Sep 16:58 CLso_ZC1_P1_vs_Lcr_BT1_LC1.blast_tab
-rw-r--r--  1 lpritc  staff    65B 21 Sep 16:58 CLso_ZC1_P2_vs_Lcr_BT1_LC1.blast_tab
-rw-r--r--  1 lpritc  staff    68B 21 Sep 16:58 CLso_NZ1_P1_vs_Lcr_BT1_LC1.blast_tab

but also gives me result output, so I think the current version should work with your data. I think it may be that the script in /usr/local/bin/average_nucleotide_identity.py (the one which is being used, according to your log file) might not be the most current version.

Please could you try running the script from the repository directory, with ./average_nucleotide_identity.py instead, and seeing if that makes a difference. If so, there might be an installation issue to get past.

L.

jinhuiwang commented 7 years ago

Hi, I run the script directly from the pyani repository this time. ANIblastall.log.zip

jinhwang@jinhwang-HP:~/bio_app/pyani$ ./average_nucleotide_identity.py -v -i ANI -o ANIblastall_out -m ANIblastall -f -g -l ANIblastall.log INFO: pyani version: 0.2.0.dev INFO: Namespace(blastall_exe='blastall', blastn_exe='blastn', classes=None, force=True, formatdb_exe='formatdb', fragsize=1020, gformat='pdf,png,eps', gmethod='mpl', graphics=True, indirname='ANI', jobprefix='ANI', labels=None, logfile='ANIblastall.log', makeblastdb_exe='makeblastdb', maxmatch=False, method='ANIblastall', noclobber=False, nocompress=False, nucmer_exe='nucmer', outdirname='ANIblastall_out', rerender=False, scheduler='multiprocessing', seed=None, sgegroupsize=10000, skip_blastn=False, skip_nucmer=False, subsample=None, verbose=True, workers=None, write_excel=False) INFO: command-line: ./average_nucleotide_identity.py -v -i ANI -o ANIblastall_out -m ANIblastall -f -g -l ANIblastall.log INFO: Input directory: ANI INFO: Creating directory ANIblastall_out INFO: Output directory: ANIblastall_out INFO: Using ANI method: ANIblastall INFO: Using scheduler method: multiprocessing INFO: Identifying FASTA files in ANI INFO: Input files: ANI/CLas_UF506_SC2.fasta ANI/CLso_FIN114_phaA.fasta ANI/CLso_ZC1_P2.fasta ANI/CLas_UF506_SC1.fasta ANI/CLaf_PTSAPSY_P1.fasta ANI/CLso_NZ1_P1.fasta ANI/CLas_psy62_FP2.fasta ANI/CLam_SaoPaulo_SP2.fasta ANI/CLso_ZC1_P1.fasta INFO: Processing input sequence lengths INFO: Sequence lengths: CLso_FIN114_phaA: 38325 CLaf_PTSAPSY_P1: 40666 CLam_SaoPaulo_SP2: 39941 CLas_psy62_FP2: 38552 CLso_NZ1_P1: 40403 CLso_ZC1_P1: 40794 CLso_ZC1_P2: 43309 CLas_UF506_SC2: 38997 CLas_UF506_SC1: 40048 INFO: Carrying out ANIblastall analysis INFO: Running ANIblastall INFO: Writing BLAST output to ANIblastall_out/blastall_output INFO: Fragmenting input files, and writing to ANIblastall_out INFO: Creating job dependency graph INFO: Running jobs with multiprocessing INFO: Running job dependency graph INFO: Command pool now running: INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLso_FIN114_phaA.fasta -t CLso_FIN114_phaA INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLam_SaoPaulo_SP2.fasta -t CLam_SaoPaulo_SP2 INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLso_ZC1_P2.fasta -t CLso_ZC1_P2 INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLaf_PTSAPSY_P1.fasta -t CLaf_PTSAPSY_P1 INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLso_ZC1_P1.fasta -t CLso_ZC1_P1 INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLso_NZ1_P1.fasta -t CLso_NZ1_P1 INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLas_UF506_SC1.fasta -t CLas_UF506_SC1 INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLas_psy62_FP2.fasta -t CLas_psy62_FP2 INFO: formatdb -p F -i ANIblastall_out/blastall_output/CLas_UF506_SC2.fasta -t CLas_UF506_SC2 Traceback (most recent call last): File "./average_nucleotide_identity.py", line 806, in results = methods[args.method][0](infiles, org_lengths) File "./average_nucleotide_identity.py", line 532, in unified_anib logger=logger) File "/home/jinhwang/bio_app/pyani/pyani/run_multiprocessing.py", line 45, in run_dependency_graph cumretval += multiprocessing_run(cmdset, workers, verbose) File "/home/jinhwang/bio_app/pyani/pyani/run_multiprocessing.py", line 86, in multiprocessing_run for cline in cmdlines] File "/home/jinhwang/bio_app/pyani/pyani/run_multiprocessing.py", line 86, in for cline in cmdlines] AttributeError: 'module' object has no attribute 'run'

widdowquinn commented 7 years ago

The error you're getting is due to using a Python version <3.5 - subprocessing.run() (the function that the script is not finding) was introduced in Python 3.5 (see https://docs.python.org/3/library/subprocess.html).

Traceback (most recent call last):
File "./average_nucleotide_identity.py", line 806, in 
results = methods[args.method]0
File "./average_nucleotide_identity.py", line 532, in unified_anib
logger=logger)
File "/home/jinhwang/bio_app/pyani/pyani/run_multiprocessing.py", line 45, in run_dependency_graph
cumretval += multiprocessing_run(cmdset, workers, verbose)
File "/home/jinhwang/bio_app/pyani/pyani/run_multiprocessing.py", line 86, in multiprocessing_run
for cline in cmdlines]
File "/home/jinhwang/bio_app/pyani/pyani/run_multiprocessing.py", line 86, in 
for cline in cmdlines]
AttributeError: 'module' object has no attribute 'run'

If you upgrade your local Python to version 3.5, then the error in your last message should go away. It is not clear in the documentation that you now need version 3.5+, which is my fault. Many apologies!

jinhuiwang commented 7 years ago

I updated Python from v3.4 to v3.5.2, now the script works fine on both ANIblastall and ANIb options! Thank you!

widdowquinn commented 7 years ago

Fantastic! I'll close the issue then, but if the same problem recurs, we can reopen it. Otherwise, please do open another issue if you have any questions or problems.