Closed widdowquinn closed 6 months ago
Since this will affect the pyani classify
stuff, do you have a method in mind for ensuring this, or is that another thing ot think about?
One way to check is to (i) read the algorithm to understand whether it is guaranteed to be symmetrical, (ii) read the MUMmer source to check that the implementation is guaranteed to be symmetrical, and (iii) perform a series of comparisons to confirm that there are no practicl counterexamples in those comparisons.
I think MUMmer/nucmer is symmetrical in principle. But if it is not symmetrical in practice there are multiple alternative ways to force symmetry, including:
N.B. option 2 there is the way I've been thinking about for ensuring symmetry for a set of n>2 genomes:
Discussions in #340, #342, and #373 are relevant here.
One way to check is to (i) read the algorithm to understand whether it is guaranteed to be symmetrical, (ii) read the MUMmer source to check that the implementation is guaranteed to be symmetrical, and (iii) perform a series of comparisons to confirm that there are no practicl counterexamples in those comparisons.
My understanding of the issue is that regardless of the order in which the genomes are processed, we should obtain the same values for aligned bases. For example, comparing genome 1 versus genome 2 should return the same number of total aligned bases as comparing genome 2 versus genome 1.
After fixing issue 340 with IntervalTree, where overlapped regions are no longer counted twice, I decided to run an analysis to see whether nucmer is symmetrical in practice, just as @widdowquinn suggested.
Reproducibility steps:
The results of the analysis are attached in mummer_symetry_comparisions.csv. Each row in the CSV file provides the following information:
test_number
column.Genome_1
) and genome 2 (Genome_2
).cmd_param
.G<num>_vs_G<num>|Genome_<num>
, where the order of the genomes used in the analysis is given before |
, and the remaining part specifies the genome, e.g., Genome_1
for genome 1 and Genome_2
for genome 2.G1_diff
(for genome 1) and G2_diff
(for genome 2).I think MUMmer/nucmer is symmetrical in principle.
I think nucmer
is not symmetrical, at least not in practice. This is based on the following observations:
nucmer
is not symmetrical in practice.--maxmatch
parameter is used.0
for viral genomes.Next step:
I looked further into this issue to see what really happens when we swap the order of the genomes.
First, I ran dnadiff
on all test sets as before (e.g., genome 1 vs genome 2 and genome 2 vs genome 1), and I compared the number of reported AlignedBases. I updated the mummer_symmetry_comparisons.csv file. All values and differences in reported values can be found in rows where the column cmd_param
is dnadiff
.
At first glance, it looks like they have the same problem. It might seem like the order in which genomes are processed does not matter for viral comparisons, but a different value for AlignedBases
is reported when much bigger bacterial genomes are compared.
I reproduced this issue on much smaller datasets and examined the delta-filter files directly to observe differences in the alignment of genome segments. I expected to find instances where certain sequences were extended either to the left or to the right, leading to discrepancies in the reported number of AlignedBases. For instance, Genome 1 might be aligned from 1bp to 450bp when compared to Genome 2, but the alignment of Genome 2 to Genome 1 might span from 1bp to 464bp. However, this wasn't the case. It appears that the problem arises when segments of Genome 1 are aligned to different regions of Genome 2. I attempted to visualize this issue, and you can view it here: nucmer_symmetry_investigation.pdf . The raw data is provided here small_test.zip.
My understanding of the issue is that regardless of the order in which the genomes are processed, we should obtain the same values for aligned bases.
My initial thoughts - on a read of the literature - were not just that we would obtain the same number of aligned bases, but that we would obtain the same alignments between genomes A and B, regardless of which was the query and which was the reference. That seems like it may not be the case.
In mummer_symmetry_comparisons.csv
you establish that the reported AlignedBases measure from dnadiff.pl
is not symmetrical. However, as we discovered from reading the source code for show-diff
, the value reported from dnadiff.pl
is not a straight count of the total number of aligned bases, but a count of the number of bases aligned in a path that is chosen through multiple possible pairwise alignments and ordered into an alignment of the genomes.
This would not necessarily be expected to be symmetrical, so the AlignedBases reported value would not necessarily be expected to be symmetrical (A vs B giving the same value as B vs A).
The main implication of this, I think, is that we would need to modify pyani
so that we run A vs B and B vs A for all pairwise comparisons, if reporting/using this figure.
I think
nucmer
is not symmetrical, at least not in practice.
That may be true, but I would be interested to know whether the IntervalTree count of aligned bases is symmetrical, where AlignedBases (reported by dnadiff.pl
) is not. Did you happen to try that?
That may be true, but I would be interested to know whether the IntervalTree count of aligned bases is symmetrical, where AlignedBases (reported by
dnadiff.pl
) is not. Did you happen to try that?
My apologies for not being clear earlier. All counts of aligned bases provided in mummer_symmetry_comparisons.csv
were obtained with our newest code, which relies on IntervalTree
to correct for overlapped regions. So, I guess that's bad news... I think you're right that we will need to run A vs B and B vs A for all pairwise comparisons. Hopefully, this will not slow pyANI
as much.
The only thing we need to think about is what to do with the alignments that are not the same when the order in which genomes are processed is changed. Should we include them in the count of aligned bases, or not?
Earlier this week, @widdowquinn and I discussed this issue and the necessary steps to address it. As part of our documentation and to keep track of current progress, it would be beneficial to write a few notes here.
When running two comparisons (genome A vs genome B and genome B vs genome A), we are faced with choices regarding how to calculate and report the total number of aligned bases. The reported aligned bases can be:
Generally speaking, alignment results may vary depending on factors like the software and parameters used, the order of genomes etc. The returned alignment is an estimation rather than a ground truth. Therefore, there's no definitive answer on how to report the total number of aligned bases. It's important to make a consistent choice in reporting methods across all comparisons to ensure reproducibility and reliability of results.
We agreed that calculating a summary between A:B and B:A might not be the best choice, as differences in reported alignments when changing the order of genomes may come from sequences of one genome aligning to different parts of the other genome, rather than an extended alignment. This leads us to the choice of reporting aligned bases calculated for either the query or reference genome. We agreed that for pyANI
, we will report the number of aligned bases for the query.
To achieve this, we need to make the following modifications to the current code:
nucmer
and delta-filter
. To ensure backward compatibility, it was agreed that no command-line parameters would be changed (e.g., ensuring 1-to-1
alignments that allow for rearrangements, using the --mum
parameter as default, but allowing --maxmatch
as an option).IntervalTree
to correct for overlapping regions.NOTE: There are related issues to this one (see: #421), where work has already been in progress under the branch issue_421
. So, the progress can be found/tested there.
The work is still in progress, and it is possible that I have made some things worse, but here's a quick update on what we have so far.
- Obtain two comparisons (A:B and B:A) for
nucmer
anddelta-filter
. To ensure backward compatibility, it was agreed that no command-line parameters would be changed (e.g., ensuring1-to-1
alignments that allow for rearrangements, using the--mum
parameter as default, but allowing--maxmatch
as an option).- Read generated files and add them to the database.
- Calculate the values for the query using
IntervalTree
to correct for overlapping regions.
The code was modified, and pyani available on branch issue_421
, will now result in two comparisons (forward and reverse). The output of nucmer for both comparisons is now provided in nucmer_output
, which will have the following structure (example for comparisons of 2 genomes):
nucmer_output
├── Genome_A
│ ├── Genome_A_vs_Genome_B.delta
│ └── Genome_A_vs_Genome_B.filter
└── Genome_B
├── Genome_B_vs_Genome_A.delta
└── Genome_B_vs_Genome_A.filter
.filter
files from both comparisons are read and added to the database. Additionally, I have also implemented IntervalTree
to exclude overlaps when calculating genome coverage.
Here is a partially working example on two viral genomes where current pyani (0.3v) returns genome coverage of more than 1.0 using the default pyani anim
command:
[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(citation=False, classes=PosixPath('symmetry_data/input/test_1/classes.txt'), dbpath=PosixPath('.pyani/pyanidb'), debug=True, disable_tqdm=False, filter_exe=PosixPath('delta-filter'), func=<function subcmd_anim at 0x7fec08257d30>, indir=PosixPath('symmetry_data/input/test_1'), jobprefix='PYANI', labels=PosixPath('symmetry_data/input/test_1/labels.txt'), logfile=PosixPath('test_1.log'), maxmatch=False, name='test_1', nofilter=False, nucmer_exe=PosixPath('nucmer'), outdir=PosixPath('symmetry_data/output/test_1'), recovery=False, scheduler='multiprocessing', sgeargs=None, sgegroupsize=10000, verbose=False, version=False, workers=None)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/anaconda3/envs/pyani_issue_421/bin/pyani anim -i symmetry_data/input/test_1 -o symmetry_data/output/test_1 -l test_1.log --name test_1 --labels symmetry_data/input/test_1/labels.txt --classes symmetry_data/input/test_1/classes.txt --debug
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]: Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]: & Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]: food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]: Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]: MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]: C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]: for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]: BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]: Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]: BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]: BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]: Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]: a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]: 25:3389-3402
[INFO] [pyani.scripts.pyani_script]: Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]: Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]: (2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]: molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.pyani_script]: fastANI: Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis K, and
[INFO] [pyani.scripts.pyani_script]: Aluru S (2018) 'High throughput ANI analysis of 90K prokaryotic
[INFO] [pyani.scripts.pyani_script]: genomes reveals clear species boundaries.' Nature Communications 9, 5114
[INFO] [pyani.scripts.pyani_script]: Checking for database file: .pyani/pyanidb
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Running ANIm analysis
[INFO] [pyani.scripts.subcommands.subcmd_anim]: MUMMer nucmer version: Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Analysis name: test_1
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Connecting to database .pyani/pyanidb
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Adding run info to database .pyani/pyanidb...
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: ...added run ID: 1 to the database
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Adding genomes for run 1 to database...
[INFO] [pyani.pyani_files]: Checking for hashfile: symmetry_data/input/test_1/MGV-GENOME-0264574.fna.md5.
[INFO] [pyani.pyani_files]: Checking for hashfile: symmetry_data/input/test_1/MGV-GENOME-0266457.fna.md5.
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: ...added genome IDs: [None, None]
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Generating ANIm command-lines
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: NUCmer output will be written temporarily to symmetry_data/output/test_1/nucmer_output
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Creating output directory symmetry_data/output/test_1/nucmer_output
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Compiling genomes for comparison
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Collected 2 genomes for this run
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Compiling pairwise comparisons (this can take time for large datasets)...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 22192.08it/s]
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...total pairwise comparisons to be performed: 1
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Checking database for existing comparison data...
[DEBUG] [pyani.pyani_orm]: Existing comparisons
{}
[DEBUG] [pyani.pyani_orm]: Checking for existing comparisons, with unique constraints
program: nucmer
version: Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)
fragsize: None
maxmatch: False
kmersize: None
minmatch: None
[DEBUG] [pyani.pyani_orm]: Checking for existing comparison: Genome 1: MGV_MGV-GENOME-0264574 (1) vs Genome 2: MGV_MGV-GENOME-0266457 (2)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...after check, still need to run 1 comparisons
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Assuming no pre-existing output files
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Creating NUCmer jobs for ANIm
0%| | 0/1 [00:00<?, ?it/s][DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
Genome 1: MGV_MGV-GENOME-0264574, Genome 2: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
nucmer --mum -p symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574 /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0266457.fna /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0264574.fna
delta_filter_wrapper.py delta-filter -1 symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.delta symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Expected output file for db: symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0266457/MGV-GENOME-0266457_vs_MGV-GENOME-0264574.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Building job
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Commands to run:
nucmer --mum -p symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457 /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0264574.fna /Users/angelikakiepas/Desktop/pyani/issue_421/symmetry_data/input/test_1/MGV-GENOME-0266457.fna
delta_filter_wrapper.py delta-filter -1 symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.delta symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Expected output file for db: symmetry_data/output/test_1/nucmer_output/MGV-GENOME-0264574/MGV-GENOME-0264574_vs_MGV-GENOME-0266457.filter
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Building job
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Results not found for 2 comparisons; 2 new jobs built.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 986.20it/s]
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Generated 2 jobs, 1 comparisons
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Passing 2 jobs to multiprocessing...
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Scheduler: multiprocessing
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Running jobs with multiprocessing
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: (using maximum number of worker threads)
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Multiprocessing run completed without error
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...jobs complete
[INFO] [pyani.scripts.subcommands.subcmd_anim]: Adding comparison results to database...
0%| | 0/2 [00:00<?, ?it/s][DEBUG] [pyani.scripts.subcommands.subcmd_anim]: MGV_MGV-GENOME-0266457 vs MGV_MGV-GENOME-0264574
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: query cov 0.997860036175579, query length: 39253, query aln length: 39169, query descripton: MGV_MGV-GENOME-0264574
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: subject cov 0.9894428448754862, subject length: 39594, subject aln length: 39176, subject descripton: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: MGV_MGV-GENOME-0264574 vs MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: query cov 0.9894428448754862, query length: 39594, query aln length: 39176, query descripton: MGV_MGV-GENOME-0266457
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: subject cov 0.997860036175579, subject length: 39253, subject aln length: 39169, subject descripton: MGV_MGV-GENOME-0264574
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 378.51it/s]
[DEBUG] [pyani.scripts.subcommands.subcmd_anim]: Committing results to database
[INFO] [pyani.scripts.subcommands.subcmd_anim]: ...database updated.
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 1.632
When trying to generate reports with the following command:
pyani report --runs -o <path_to_the_output> --formats=stdout --run_results 1
The reports are also generated without any errors, and the values are improved and as expected when overlaps are not counted twice. When calling the report for specific runs, we might want to add separate columns for the subject alignment length
and query alignment length
, as these are not the same. Currently, the alignment length
column holds the value for query alignment length.
TABLE: symmetry_data/output/test_1/runs
run ID name method date run command-line
0 1 test_1 ANIm 2024-02-22 17:46:55.213302 /opt/anaconda3/envs/pyani_issue_421/bin/pyani anim -i symmetry_data/input/test_1 -o symmetry_data/output/test_1 -l test_1.log --name test_1 --labels symmetry_data/input/test_1/labels.txt --classes symmetry_data/input/test_1/classes.txt
TABLE: symmetry_data/output/test_1/results_1
Comparison ID Query ID Query description Subject ID Subject description % identity % query coverage % subject coverage alignment length similarity errors program version fragment size maxmatch Run ID
0 1 1 MGV_MGV-GENOME-0264574 2 MGV_MGV-GENOME-0266457 0.994332 0.997860 0.989443 39169 222 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) None False 1
1 2 2 MGV_MGV-GENOME-0266457 1 MGV_MGV-GENOME-0264574 0.994333 0.989443 0.997860 39176 222 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) None False 1
The only issue, we have is that we cannot generate matrices (eg. coverage_matrix, identity_matrix) as it results in an error being raised:
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro issue_421 % pyani report -o symmetry_data/output/test_1 --formats=stdout --run_matrices 1
Traceback (most recent call last):
File "/opt/anaconda3/envs/pyani_issue_421/bin/pyani", line 33, in <module>
sys.exit(load_entry_point('pyani', 'console_scripts', 'pyani')())
File "/Users/angelikakiepas/Desktop/pyani/pyani/scripts/pyani_script.py", line 143, in run_main
returnval = args.func(args)
File "/Users/angelikakiepas/Desktop/pyani/pyani/scripts/subcommands/subcmd_report.py", line 265, in subcmd_report
matlabel_dict = get_matrix_labels_for_run(session, run_id)
File "/Users/angelikakiepas/Desktop/pyani/pyani/pyani_orm.py", line 386, in get_matrix_labels_for_run
session.query(Genome.genome_id, Label.label)
File "<string>", line 2, in join
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/base.py", line 280, in _generative
x = fn(self, *args, **kw)
File "<string>", line 2, in join
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/orm/base.py", line 303, in generate
fn(self, *args, **kw)
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 2428, in join
onclause_element = coercions.expect(
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 396, in expect
resolved = impl._literal_coercion(
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 899, in _literal_coercion
self._raise_for_expected(element)
File "/opt/anaconda3/envs/pyani_issue_421/lib/python3.8/site-packages/sqlalchemy/sql/coercions.py", line 518, in _raise_for_expected
raise exc.ArgumentError(msg, code=code) from err
sqlalchemy.exc.ArgumentError: ON clause, typically a SQL expression or ORM relationship attribute expected, got <class 'pyani.pyani_orm.Run'>.
I understand that, I will need to update the matrices to reflect the difference between the two comparisons, but I can't get to figure out what is causing the above error.
I thought I'd fixed that error on master
- do you have a minimal reproducible example as a script and set of input files I can try? You can upload the inputs/script as a .zip
file here, if you like.
There you go: run_fail.zip.
It made me wonder if maybe I haven't rebase to master
properly.
Just a quick update on the matter. The issue turned out to be caused by incorrect rebasing on my end. The code is working as expected under the branch issue_421
.
Before we can implement the changes in the master
branch, we first need to:
anim.py
and subcmd_anim.py
pytest
(all tests should pass)After working on issue #422 and adding additional columns for the --run_results
output table, I have also worked on things relevant to this issue.
I have been working on pyANI
under branch issue_421
, and the code will now result in:
IntervalTree
Python package)For example, running pyANI
analysis on 2 troublesome viral genomes using the current code of pyANI
available under the master
branch (all code, input, and output available here test_run_pyani_current.zip) will result in:
MGV_MGV-GENOME-0264574:1 MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1 1.0 1.5078337961
MGV_MGV-GENOME-0266457:2 1.4948477042000001 1.0
matrix_aln_lengths.tab
are the same for A vs B, and B vs A): MGV_MGV-GENOME-0264574:1 MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1 39253 59187
MGV_MGV-GENOME-0266457:2 59187 39594
MGV_MGV-GENOME-0264574:1 MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1 1.0 0.9962491763
MGV_MGV-GENOME-0266457:2 0.9962491763 1.0
However, running the comparisons under branch issue_421
corrected for these issues (all code, input, and output data are provided here: test_run.zip). For example,
MGV_MGV-GENOME-0264574:1 MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1 1.0 0.9978600362000001
MGV_MGV-GENOME-0266457:2 0.9894428449 1.0
MGV_MGV-GENOME-0264574:1 MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1 39253 39169
MGV_MGV-GENOME-0266457:2 39176 39594
MGV_MGV-GENOME-0264574:1 MGV_MGV-GENOME-0266457:2
MGV_MGV-GENOME-0264574:1 1.0 0.9943322525
MGV_MGV-GENOME-0266457:2 0.9943332653 1.0
For reference, here is a table output for the parameter --run_results
:
Comparison ID Query ID Query description Subject ID Subject description % query identity % subject identity % query coverage % subject coverage query alignment length subject alignment length similarity errors program version fragment size maxmatch Run ID
0 1 2 MGV_MGV-GENOME-0266457 1 MGV_MGV-GENOME-0264574 0.9943332652644477 0.9943322525466568 0.9894428448754862 0.997860036175579 39176 39169 222 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) False 1
1 2 1 MGV_MGV-GENOME-0264574 2 MGV_MGV-GENOME-0266457 0.9943322525466568 0.9943332652644477 0.997860036175579 0.9894428448754862 39169 39176 222 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) False 1
The only issue I am currently facing is that no graphical output is provided when running the analysis under branch issue_421
. I'm currently trying to investigate what causes this. Interestingly, there is no error raised:
[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(version=False, citation=False, logfile=None, verbose=True, debug=False, disable_tqdm=False, outdir=PosixPath('output'), run_ids=['1'], dbpath=PosixPath('.pyani/pyanidb'), formats=['pdf'], method='seaborn', workers=None, func=<function subcmd_plot at 0x1137e8d30>)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/anaconda3/envs/pyani_v3/bin/pyani plot -o output --run_id 1 -v --formats pdf
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]: Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]: & Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]: food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]: Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]: MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]: C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]: for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]: BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]: Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]: BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]: BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]: Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]: a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]: 25:3389-3402
[INFO] [pyani.scripts.pyani_script]: Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]: Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]: (2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]: molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Generating graphical output for analyses
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Writing output to: output
[INFO] [pyani.scripts.subcommands.subcmd_plot]: Rendering method: seaborn
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 6.745
I'm marking this for closure, because my understanding is that this issue was fixed with 1f10a6c.
We've established that ANIm - as originally implemented - is not a true metric. But we still have to identify a method that does constitute an actual metric (though it may be constrained to the specific genome input set…).
Summary:
It's not precisely established that ANIm identities are true metrics - we need to demonstrate and ensure this, otherwise the clique-based classification is not as sound as we'd hope.