Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes
Create a new conda environment and install:
conda create -n melon -c conda-forge -c bioconda melon
conda activate melon
[!NOTE] We suggest using the GTDB database for complex metagenomes, as it features less ambiguous taxonomic labels and is more comprehensive.
Download either the NCBI or the GTDB database:
## NCBI
wget -qN --show-progress https://zenodo.org/records/12571302/files/database.tar.gz
tar -zxvf database.tar.gz
## GTDB
# wget -qN --show-progress https://zenodo.org/records/12571554/files/database.tar.gz
# tar -zxvf database.tar.gz
Index the files:
## if you encounter memory issue please consider manually lowering cpu_count or simply set cpu_count=1
cpu_count=$(python -c 'import os; print(os.cpu_count())')
diamond makedb --in database/prot.fa --db database/prot --quiet
ls database/nucl.*.fa | sort | xargs -P $cpu_count -I {} bash -c '
filename=${1%.fa*};
filename=${filename##*/};
minimap2 -x map-ont -d database/$filename.mmi ${1} 2> /dev/null;
echo "Indexed <database/$filename.fa>.";' - {}
## remove unnecessary files to save space
rm -rf database/*.fa
[!NOTE] Melon takes quality-controlled long reads as input. We suggest removing low-quality raw reads before running Melon with e.g.,
nanoq -q 10 -l 1000
(min. quality score 10; min. read length 1,000 bp). If your sample is known to have a large proportion of human DNAs or other eukaryotes/viruses and you want to estimate the mean genome size of prokaryotes, please consider removing them via proper mapping, or enabling the simple pre-filtering module. See Run Melon with pre-filtering of non-prokaryotic reads for more details.
We provide an example file comprising 10,000 quality-controlled (processed with Porechop
and nanoq
) prokaryotic reads (fungal and other reads removed with minimap2
), randomly selected from the R10.3 mock sample of Loman Lab Mock Community Experiments.
wget -qN --show-progress https://zenodo.org/records/12571849/files/example.fa.gz
melon example.fa.gz -d database -o .
You should see:
INFO: Estimating genome copies ...
INFO: ... found 27.375 copies of genomes (bacteria: 27.375; archaea: 0).
INFO: Assigning taxonomy ...
INFO: Reassigning taxonomy ...
INFO: ... found 8 unique species (bacteria: 8; archaea: 0).
INFO: Done.
The output file *.tsv
contains the estimated genome copies for individual species, their corresponding relative abundances and gap-compressed/gap-uncompressed ANI (average nucleotide identity between marker-gene-containing reads and reference genome clusters) values:
... species copy abundance identity
... 287|Pseudomonas aeruginosa 2.125 7.762557e-02 0.9570/0.9473
... 96241|Bacillus spizizenii 2.875 1.050228e-01 0.9617/0.9531
... 1351|Enterococcus faecalis 3.000 1.095890e-01 0.9616/0.9534
... 28901|Salmonella enterica 3.125 1.141553e-01 0.9525/0.9433
... 562|Escherichia coli 3.500 1.278539e-01 0.9588/0.9504
... 1639|Listeria monocytogenes 3.750 1.369863e-01 0.9627/0.9546
... 1280|Staphylococcus aureus 3.875 1.415525e-01 0.9598/0.9517
... 1613|Limosilactobacillus fermentum 5.125 1.872146e-01 0.9654/0.9574
The output file *.json
contains the lineage and remark of each processed read.
{
"002617ff-697a-4cd5-8a97-1e136a792228": {
"remark": "marker-gene-containing",
"lineage": "2|Bacteria;1239|Bacillota;91061|Bacilli;186826|Lactobacillales;81852|Enterococcaceae;1350|Enterococcus;1351|Enterococcus faecalis"
},
...
"ffe73d61-55eb-4ad8-9519-e38c364fc11d": {
"remark": "marker-gene-containing",
"lineage": "2|Bacteria;1224|Pseudomonadota;1236|Gammaproteobacteria;91347|Enterobacterales;543|Enterobacteriaceae;590|Salmonella;28901|Salmonella enterica"
}
}
To enable the pre-filtering module, you need to download a database of Kraken2 that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:
## https://benlangmead.github.io/aws-indexes/k2
mkdir database_kraken
wget -q --show-progress https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08gb_20230605.tar.gz -O database_kraken/db.tar.gz
tar -zxvf database_kraken/db.tar.gz -C database_kraken
## remove temporary files to save space
rm -rf database_kraken/db.tar.gz
Run with argument -k/-db-kraken
:
melon *.fa -d database -o . -k database_kraken
Both samples were collected from the Shatin wastewater treatment plant, sequenced with ONT SQK-NBD114 & R10.4.1 on PromethION, basecalled with Guppy v6.5.7.
no pre-filter | PlusPF-8 | PlusPF-16 | PlusPF | ||
---|---|---|---|---|---|
influent
|
genome copy | 1,809 | 1,801 | 1,800 | 1,797 |
species richness | 2,101 | 2,098 | 2,099 | 2,097 | |
number of filtered reads | - | 8,312 | 10,212 | 14,988 | |
mean genome size | 4.322 | 4.304 | 4.300 | 4.292 | |
real time (sec) | 2,056 | 2,271 | 2,279 | - | |
peak resident set size (GB) | 10.629 | 10.665 | 17.294 | - | |
effluent
|
genome copy | 1,348 | 1,336 | 1,331 | 1,315 |
species richness | 1,704 | 1,697 | 1,700 | 1,696 | |
number of filtered reads | - | 16,602 | 29,300 | 54,774 | |
mean genome size | 3.826 | 3.789 | 3.757 | 3.715 | |
real time (sec) | 1,341 | 1,496 | 1,495 | - | |
peak resident set size (GB) | 10.893 | 10.902 | 17.067 | - |
Tested on MacBook Pro 2021, Apple M1 Max, 64 GB memory, macOS Sonoma. Melon v0.1.0, NCBI database (ver. RefSeq R219) and Kraken database (ver. 2023-06-05). Mean genome size is in unit of Mb. Real time and peak resident set size are measured with time
.
Chen, X., Yin, X., Shi, X., Yan, W., Yang, Y., Liu, L., & Zhang, T. (2024). Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes. Genome Biol, 25(1), 226. https://doi.org/10.1186/s13059-024-03363-y