wejlab / MetaScope

An R-based approach for preprocessing and aligning 16S, metagenomic, and metatranscriptomic data (PathoScope version 3.0)
GNU General Public License v3.0
16 stars 7 forks source link

Is there any difference in algorithm between this and Pathoscope #33

Closed Missthepast closed 3 months ago

Missthepast commented 3 months ago

Hello, I would like to ask Is there any difference in algorithm between this and Pathoscope, and is it more accurate in identifying pathogenic microorganisms? Is there an advantage or disadvantage in terms of speed compared to pathoscope We look forward to your reply

aubreyodom commented 3 months ago

Hi there! Yes there are a few differences, which I wrote about in Chapter 2 of my dissertation (see here, and click View/Download at the bottom). You can also see benchmarking results in comparison to PathoScope in Chapter 3.

Here's an excerpt from the conclusion:

"Previously, the Johnson lab has developed PathoScope 2.030, a framework for quantifying the proportions of reads from individual microbial taxa present in metagenomic sequencing data from environmental or clinical samples. PathoScope 2.0 employs a multi-step analysis pipeline that integrates read mapping, probabilistic modeling, and statistical inference to accurately detect and quantify microbial sequences in metagenomic data. MetaScope provides several benefits over PathoScope 2.0. Primarily, MetaScope in its most condensed form offers a far smaller alignment file storage footprint with improved time management and efficiency, as exhibited in the example benchmark. PathoScope 2.0 modules relied on alignment files encoded in the SAM file format, a human readable text file containing alignment metadata18. These SAM files, which often span hundreds of gigabytes of disk space in the PathoScope pipeline, have been superseded by the bam file format which offers a compressed version of the SAM file format.

In this same vein, speed and efficiency in processing have been improved thanks to improved parallel sequencing in the MetaFilter module and simplification of the mixture modeling and expectation-maximization steps in the MetaID module. Another marked improvement is visible with the MetaRef module, which eradicates dependencies on the now-deprecated NCBI GenInfo identifiers (GI) number sequence annotations. PathoScope 2.0 modules (PathoLib, PathoID, and PathoReport) rely on GI numbers as a means of linking sequence information. In 2016, NCBI transitioned away from including GI numbers in its fasta records, which subsequently compromised the functionality of these modules. MetaRef overcomes this by introducing a system to pull genomes from the NCBI nucleotide database that is independent of the GI system, relying instead on a taxonomy hierarchy.

With regards to the filtering step, MetaFilter completely removes any reads that mapped to the target and host libraries, whereas PathoScope 2.0’s filtering module (PathoMap) only filters a read if it scores a higher mapping score to a filter genome than to its mapped target genome. This was prevalent in the example benchmark as PathoScope retained several hundred reads more than MetaScope and the “ground truth” number of reads belonging to the RF122 strain. The improper weighting of reads in the PathoScope filtering method was implemented to help prevent discarding of reads that only have a low mapping score to a filter genome, but this method fails in cases where a read maps equally well to both target genome and host genome, or if the target genome is affected by contamination.

Outside of these functional aspects, The MetaBLAST module and complementary coverage plots in the MetaID module pose a major contribution to MetaScope over the PathoScope 2.0 pipeline, allowing for additional quality checking and improved post-processing. The inclusion of a confirmatory pipeline with evaluation metrics is novel for a taxonomic profiling software, and highly necessary given the current state of contamination41,43,44. The recent prevalence of high throughput and the accelerating low cost of next-generation sequencing (NGS) technologies has led to a rapid increase in published genomes available in the RefSeq libraries, although imperfect methods and protocols for sequencing data are contributing to high contamination rates. Human contamination in published genomes, while not a problem in 16S analyses, is a particularly frustrating problem when analyzing shotgun metagenomics data. While the MetaBLAST step and preparatory outputs from MetaScope do require significantly higher rates of RAM and runtime, it is expected that a user will only want or need to run this step on a few of their samples for manual inspection. Further, it is likely altogether unnecessary for genus-level identification.

Finally, MetaScope was developed as a cohesive R package allowing integration with the animalcules R package for downstream microbiomics analysis. R is a widely used, accessible, and highly renowned statistical software that specifically presents biological researchers with extensive open-access package curation and community support via the Bioconductor software project. It offers powerful statistical analysis and visualization capabilities which are taken advantage of in the MetaScope software implementation. As a result of the reliance on an R implementation of samtools, Rbowtie2, and Rsubread, MetaScope can be run without prior installation of external packages for optimal ease of use with a simple call to the Bioconductor installation function (although a user installation of samtools can optionally be accessed for faster runtimes). In comparison, PathoScope requires a separate installation of Bowtie 2, forms its own Python library, and is incapable of running Python instances above 2.7, resulting in a mismatch of requirements. As such, MetaScope is a welcome change in quick and easy installation that takes only a matter of minutes."