widdowquinn / pyani

Application and Python module for average nucleotide identity analyses of microbes.
http://widdowquinn.github.io/pyani/
MIT License
185 stars 54 forks source link

Fix alignment coverage >1.0 and aniM symmetrical behaviour #421

Closed kiepczi closed 1 month ago

kiepczi commented 4 months ago

Summary: There are two big issues that need to be prioritised to improve pyANI comparisions. These are:

Description: Alignment coverage >1.0:

Fixing issue #340 first. Let's consider running pyANI aniM on two genomes and finding that there were 2 alignments: one alignment runs from position 1 to 37636, and the other alignment runs from 17626 to 39176 (in the query). The total length of the query genome is 39594, so we cannot expect the number of aligned bases to be more than this. However, currently pyANI reports that the total number of aligned bases in the query is 59187, which is calculated as follows: (37636-1+1)+(39176-17626+1) = 59187. This discrepancy arises because pyANI does not currently take into consideration that certain regions might overlap and are unfortunately counted twice. In this case, region 17626 to 37636.

Progress:

I attempted to correct for this by using IntervalTree Python module. The code can be found here, and in a short summary it works in the following steps:

I have checked the comprehensiveness of this script by running it on 20 test sets. I also compared the values to those provided by dnadiff, specifically the AlignedBases value. As a quick reminder, dnadiff can compare genomes, returning a value for aligned bases (AlignedBases), which, as suggested in previous discussions, can be interpreted as the genome coverage. I have attached the results of this analysis (see comparisions.csv) In the table, each row represents a single pairwise comparison, with the following columns:

There are few interesting observations in this analysis:

  1. In all cases, the count of aligned bases is lower than the length of the genomes. This was especially an issue for viral genomes (see first two rows in the table).
  2. AlignedBases value provided by dnadiff and our values in IT_dnadiff_mdelta columns do not match as we would expect. And we would expect them to match, because AlignedBases is calculated based on many-to-many alignments. I investigated this discepcany, and reproduced this issue on much smaller test set ( test_1_incongruencies.zip).

Reproducible steps:

  1. Calculated AlignedBases for the attached genomes with dnadiff using:
    dnadiff test_1_incongruencies/input/GCF_003369795.1_ASM336979v1_genomic.fna GCF_015244315.1_troublesome_regions.fna
  2. Recaluclated the value with our scipt, using dnadiff mdelta file as the input.

Investigation Let's look at the mdelta file to see what is happening, or what value we would expect:

/Users/angelikakiepas/Desktop/test_1_incongruencies/input/GCF_003369795.1_ASM336979v1_genomic.fna /Users/angelikakiepas/Desktop/test_1_incongruencies/input/GCF_015244315.1_troublesome_regions.fna
NUCMER
>NZ_CP026730.1 lcl|NZ_CP043317.1_cds_WP_194276455.1_3164 7787608 426
4263072 4263462 34 424 45 45 0
0
>NZ_CP026730.1 lcl|NZ_CP043317.1_cds_WP_194276459.1_3169 7787608 336
4258377 4258701 325 1 22 22 0
108
-4
0
>NZ_CP026730.1 lcl|NZ_CP043317.1_cds_WP_194277564.1_4768 7787608 447
4263071 4263319 15 266 41 41 0
-195
-1
-1
0

The delta file reveals three alignments. One alignment spans from 4263072 to 4263462, the second from 4258377 to 4258701, and the third from 4263071 to 4263319. From this we can see that there is one region that overlaps, and the aligment looks like this:

            4263072 ============ 4263426
4258377 ==== 4258701
            4263071 ======= 4263319

As expected, after the IntervalTree merges the overlaps, it returns the following regions:

IntervalTree([Interval(4258377, 4258701), Interval(4263071, 4263462)])

This gives a count of aligned bases of 717, calculated as (4258701 - 4258377 + 1) = 325 and (4263462 - 4263071 + 1) = 392, giving us the sum of 717. From this we can conclude that the unagliged regions are: 1-4258376, 4258702-4263070, and 4263463-7787608.

However, the dnadiff AlignedBases value is 716.

TotalBases                   7787608                 1209
AlignedBases              716(0.01%)          968(80.07%)
UnalignedBases       7786892(99.99%)          241(19.93%)

To investigate the difference, @widdowquinn looked into the dnadiff source code and how it calculates the AlignedBases value. The process involves opening coordinates files containing information about aligned regions, incrementing the AlignedBases value by the length of each aligned sequence (considering all indels and no overlaps), and then adjusting for gaps by subtracting the length of unaligned regions given in rdiff (column 4) file from the AlignedBases value.

Our suspicion led us to examine the rdiff file:

NZ_CP026730.1    BRK    1    4258376    4258376
NZ_CP026730.1    SEQ    4258702    4263071    4370    lcl|NZ_CP043317.1_cds_WP_194276459.1_3169    lcl|NZ_CP043317.1_cds_WP_194276455.1_3164
NZ_CP026730.1    BRK    4263463    7787608    3524146

This shows there is a gap discrepancy, where one gap is longer by 1 bp. For example, dnadiff considers the gap to run from 4258702 to 4263071, whereas our code considers the gap to run from 4258702 to 4263070. We know that the rdiff file is generated by the show-diff program (source code available here). Unfortunately, I have no experience in C++, so understanding how exactly the gaps are identified is beyond my capability ;). @widdowquinn and I had a conversation about this last week, and I understand that show-diff uses graphs to identify unaligned regions, and this is where the differences might arise.

Next steps:

  1. Check if the current code can fix issue 340.
  2. Investigate how our AlignedBases value will change based on nucmer parameters.
  3. Determine which nucmer parameters will be most appropriate for the implementation in pyANI.
kiepczi commented 3 months ago

Should probably mention that I'm currently handling this under branch issue_421 for version 0.3+.

kiepczi commented 3 months ago

Issue #151 is also relevant here and provides some answers for #373, with progress also being made on the branch issue_421.

kiepczi commented 2 months ago

Update on the issue and its progress:

We have decided to change the way we calculate the number of aligned bases and genome coverage by correcting for overlapping regions with IntervalTree. We have also revised the method for calculating percentage identity and will now report this as weighted ANIm %ID.

We can't replicate dnadiff values of AlignedBases and AvgIdentity from delta files, as they use a graph model in show-coords, resulting in a different values being reported. For more information see #424 and #422.

Following an investigation into the behavior of nucmer, we have discovered that it is not symmetrical in practice. As a result, we have agreed to run two comparisons (reverse and forward). The values in matrices are now reported for query comparisons rather than the average between the two comparisons. Further details regarding this decision are discussed in issue #151.

kiepczi commented 2 months ago

I ran pytest on all tests, and everything looks good.

(issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests
================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.19, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: cov-4.1.0, ordering-0.6
collected 120 items                                                                                                                                      

tests/test_anib.py::test_get_version_nonetype PASSED                                                                                               [  0%]
tests/test_anib.py::test_get_version_no_exe PASSED                                                                                                 [  1%]
tests/test_anib.py::test_get_version_exe_not_executable PASSED                                                                                     [  2%]
tests/test_anib.py::test_get_version_exe_no_version PASSED                                                                                         [  3%]
tests/test_anib.py::test_blastall_dbjobdict PASSED                                                                                                 [  4%]
tests/test_anib.py::test_blastall_graph PASSED                                                                                                     [  5%]
tests/test_anib.py::test_blastall_multiple PASSED                                                                                                  [  5%]
tests/test_anib.py::test_blastall_single PASSED                                                                                                    [  6%]
tests/test_anib.py::test_blastn_dbjobdict PASSED                                                                                                   [  7%]
tests/test_anib.py::test_blastn_graph PASSED                                                                                                       [  8%]
tests/test_anib.py::test_blastn_multiple PASSED                                                                                                    [  9%]
tests/test_anib.py::test_blastn_single PASSED                                                                                                      [ 10%]
tests/test_anib.py::test_formatdb_multiple PASSED                                                                                                  [ 10%]
tests/test_anib.py::test_formatdb_single PASSED                                                                                                    [ 11%]
tests/test_anib.py::test_fragment_files PASSED                                                                                                     [ 12%]
tests/test_anib.py::test_makeblastdb_multiple PASSED                                                                                               [ 13%]
tests/test_anib.py::test_makeblastdb_single PASSED                                                                                                 [ 14%]
tests/test_anib.py::test_parse_legacy_blastdir PASSED                                                                                              [ 15%]
tests/test_anib.py::test_parse_blastdir PASSED                                                                                                     [ 15%]
tests/test_anib.py::test_parse_blasttab PASSED                                                                                                     [ 16%]
tests/test_anib.py::test_parse_legacy_blasttab PASSED                                                                                              [ 17%]
tests/test_aniblastall.py::test_get_version_nonetype PASSED                                                                                        [ 18%]
tests/test_aniblastall.py::test_get_version_missing_exe PASSED                                                                                     [ 19%]
tests/test_aniblastall.py::test_get_version_not_executable PASSED                                                                                  [ 20%]
tests/test_aniblastall.py::test_get_version_no_version PASSED                                                                                      [ 20%]
tests/test_aniblastall.py::test_get_version_os_incompatible PASSED                                                                                 [ 21%]
tests/test_anim.py::test_get_version_nonetype PASSED                                                                                               [ 22%]
tests/test_anim.py::test_get_version_no_exe PASSED                                                                                                 [ 23%]
tests/test_anim.py::test_get_version_exe_not_executable PASSED                                                                                     [ 24%]
tests/test_anim.py::test_get_version_exe_no_version PASSED                                                                                         [ 25%]
tests/test_anim.py::test_get_version_nucmer_3 PASSED                                                                                               [ 25%]
tests/test_anim.py::test_get_version_nucmer_4 PASSED                                                                                               [ 26%]
tests/test_anim.py::test_deltadir_parsing PASSED                                                                                                   [ 27%]
tests/test_anim.py::test_deltafile_parsing PASSED                                                                                                  [ 28%]
tests/test_anim.py::test_maxmatch_single PASSED                                                                                                    [ 29%]
tests/test_anim.py::test_mummer_multiple PASSED                                                                                                    [ 30%]
tests/test_anim.py::test_mummer_single PASSED                                                                                                      [ 30%]
tests/test_anim.py::test_mummer_job_generation PASSED                                                                                              [ 31%]
tests/test_anim.py::test_genome_sorting SKIPPED (This test currently fails because we no longer sort genomes.      This is because we now run ...) [ 32%]
tests/test_cli_parsing.py::test_createdb PASSED                                                                                                    [ 33%]
tests/test_cli_parsing.py::test_download_single_genome PASSED                                                                                      [ 34%]
tests/test_concordance.py::test_anim_concordance PASSED                                                                                            [ 35%]
tests/test_concordance.py::test_anib_concordance PASSED                                                                                            [ 35%]
tests/test_concordance.py::test_aniblastall_concordance PASSED                                                                                     [ 36%]
tests/test_concordance.py::test_tetra_concordance PASSED                                                                                           [ 37%]
tests/test_dependencies.py::test_import_biopython PASSED                                                                                           [ 38%]
tests/test_dependencies.py::test_import_matplotlib PASSED                                                                                          [ 39%]
tests/test_dependencies.py::test_import_numpy PASSED                                                                                               [ 40%]
tests/test_dependencies.py::test_import_pandas PASSED                                                                                              [ 40%]
tests/test_dependencies.py::test_import_scipy PASSED                                                                                               [ 41%]
tests/test_dependencies.py::test_blastn_available PASSED                                                                                           [ 42%]
tests/test_dependencies.py::test_run_blastall XPASS (Optional third-party executable (blastall))                                                   [ 43%]
tests/test_dependencies.py::test_run_nucmer PASSED                                                                                                 [ 44%]
tests/test_fastani.py::test_get_version_nonetype PASSED                                                                                            [ 45%]
tests/test_fastani.py::test_get_version_no_exe PASSED                                                                                              [ 45%]
tests/test_fastani.py::test_get_version_exe_not_executable PASSED                                                                                  [ 46%]
tests/test_fastani.py::test_get_version_exe_no_version PASSED                                                                                      [ 47%]
tests/test_fastani.py::test_fastanifile_parsing PASSED                                                                                             [ 48%]
tests/test_fastani.py::test_fastani_single PASSED                                                                                                  [ 49%]
tests/test_fastani.py::test_fastani_multiple PASSED                                                                                                [ 50%]
tests/test_fastani.py::test_fastani_job_generation PASSED                                                                                          [ 50%]
tests/test_graphics.py::test_png_mpl PASSED                                                                                                        [ 51%]
tests/test_graphics.py::test_svg_mpl PASSED                                                                                                        [ 52%]
tests/test_graphics.py::test_pdf_mpl PASSED                                                                                                        [ 53%]
tests/test_graphics.py::test_png_seaborn PASSED                                                                                                    [ 54%]
tests/test_graphics.py::test_svg_seaborn PASSED                                                                                                    [ 55%]
tests/test_graphics.py::test_pdf_seaborn PASSED                                                                                                    [ 55%]
tests/test_jobs.py::test_create_job PASSED                                                                                                         [ 56%]
tests/test_jobs.py::test_create_job_with_command PASSED                                                                                            [ 57%]
tests/test_jobs.py::test_add_dependency PASSED                                                                                                     [ 58%]
tests/test_jobs.py::test_remove_dependency PASSED                                                                                                  [ 59%]
tests/test_jobs.py::test_create_jobgroup PASSED                                                                                                    [ 60%]
tests/test_jobs.py::test_1d_jobgroup PASSED                                                                                                        [ 60%]
tests/test_jobs.py::test_2d_jobgroup PASSED                                                                                                        [ 61%]
tests/test_jobs.py::test_add_group_dependency PASSED                                                                                               [ 62%]
tests/test_jobs.py::test_remove_group_dependency PASSED                                                                                            [ 63%]
tests/test_legacy_scripts.py::test_legacy_anim_sns PASSED                                                                                          [ 64%]
tests/test_legacy_scripts.py::test_legacy_anim_mpl PASSED                                                                                          [ 65%]
tests/test_legacy_scripts.py::test_legacy_anib_sns PASSED                                                                                          [ 65%]
tests/test_legacy_scripts.py::test_legacy_anib_mpl PASSED                                                                                          [ 66%]
tests/test_legacy_scripts.py::test_legacy_tetra_sns PASSED                                                                                         [ 67%]
tests/test_legacy_scripts.py::test_legacy_tetra_mpl PASSED                                                                                         [ 68%]
tests/test_legacy_scripts.py::test_legacy_genome_downloads PASSED                                                                                  [ 69%]
tests/test_multiprocessing.py::test_multiprocessing_run PASSED                                                                                     [ 70%]
tests/test_multiprocessing.py::test_cmdsets PASSED                                                                                                 [ 70%]
tests/test_multiprocessing.py::test_dependency_graph_run PASSED                                                                                    [ 71%]
tests/test_multiprocessing.py::test_failed_comparison PASSED                                                                                       [ 72%]
tests/test_parsing.py::test_anim_delta PASSED                                                                                                      [ 73%]
tests/test_subcmd_01_download.py::test_create_hash PASSED                                                                                          [ 74%]
tests/test_subcmd_01_download.py::test_download_dry_run PASSED                                                                                     [ 75%]
tests/test_subcmd_01_download.py::test_download_c_blochmannia PASSED                                                                               [ 75%]
tests/test_subcmd_01_download.py::test_download_kraken PASSED                                                                                      [ 76%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_index PASSED                                                                              [ 77%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_missing_index PASSED                                                                      [ 78%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_create_newdb PASSED                                                                 [ 79%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_create_newdb_fail PASSED                                                            [ 80%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_create_newdb_force PASSED                                                           [ 80%]
tests/test_subcmd_04_anim.py::TestANImSubcommand::test_anim PASSED                                                                                 [ 81%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes PASSED                                                                          [ 82%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes_runs PASSED                                                                     [ 83%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_matrices PASSED                                                                         [ 84%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_no_matrix_labels PASSED                                                                 [ 85%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_results PASSED                                                                          [ 85%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs PASSED                                                                             [ 86%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs_genomes PASSED                                                                     [ 87%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_jpg PASSED                                                                         [ 88%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_pdf PASSED                                                                         [ 89%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_png PASSED                                                                         [ 90%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_svg PASSED                                                                         [ 90%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_jpg PASSED                                                                     [ 91%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_pdf PASSED                                                                     [ 92%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_png PASSED                                                                     [ 93%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_svg PASSED                                                                     [ 94%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_defaults PASSED                                                            [ 95%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_no_thresh PASSED                                                           [ 95%]
tests/test_subcmd_08_anib.py::TestANIbsubcommand::test_anib XFAIL (ANIb is not currently fully implemented)                                        [ 96%]
tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani PASSED                                                                        [ 97%]
tests/test_tetra.py::test_tetraclean PASSED                                                                                                        [ 98%]
tests/test_tetra.py::test_zscore PASSED                                                                                                            [ 99%]
tests/test_tetra.py::test_correlations PASSED                                                                                                      [100%]

==================================================================== warnings summary ====================================================================
tests/test_legacy_scripts.py::test_legacy_anim_sns
  /opt/anaconda3/envs/issue_421/lib/python3.8/site-packages/seaborn/matrix.py:715: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
    self._figure = plt.figure(figsize=figsize)

tests/test_subcmd_01_download.py::test_download_dry_run
  /opt/anaconda3/envs/issue_421/lib/python3.8/site-packages/Bio/Entrez/Parser.py:1075: UserWarning: Failed to save uilist.dtd at /Users/angelikakiepas/.config/biopython/Bio/Entrez/DTDs/uilist.dtd
    warnings.warn(f"Failed to save {filename} at {path}")

tests/test_subcmd_01_download.py::test_download_dry_run
  /opt/anaconda3/envs/issue_421/lib/python3.8/site-packages/Bio/Entrez/Parser.py:1075: UserWarning: Failed to save esummary_assembly.dtd at /Users/angelikakiepas/.config/biopython/Bio/Entrez/DTDs/esummary_assembly.dtd
    warnings.warn(f"Failed to save {filename} at {path}")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================== 117 passed, 1 skipped, 1 xfailed, 1 xpassed, 3 warnings in 406.99s (0:06:46) ====================================

I also run the new code on examples available here. Matrices are no longer symmetrical, two comparisons are executed (reverse and forward), more sensible numbers are reported (e.g., genome coverage is no longer greater than 1), and no errors are raised. However, no graphical output is generated. Interestingly, no issue is raised either. I was looking into this this morning, and I don't know where to go with this... @widdowquinn, do you think something broke after changing pyani_orm and other scripts when you were moving everything to GitActions?

kiepczi commented 2 months ago

no graphical output is generated. Interestingly, no issue is raised either.

This has been fixed at today's meeting by @widdowquinn. It turned out that the plotting function was indexing args.method incorrectly (fixed with the 2cede3a commit).

widdowquinn commented 2 months ago

Fixed with 1f10a6c.

widdowquinn commented 2 months ago

@all-contributors please add @kiepczi for code, design, test

allcontributors[bot] commented 2 months ago

@widdowquinn

I've put up a pull request to add @kiepczi! :tada: