wwood / singlem

Novelty-inclusive microbial community profiling of shotgun metagenomes
http://wwood.github.io/singlem/
GNU General Public License v3.0
112 stars 16 forks source link

Binned populations from appraise not matching numbers extracted from genomes #166

Open adityabandla opened 4 months ago

adityabandla commented 4 months ago

Hi Ben,

Apologies for the multiple questions. I ran singlem separately on my metagenome reads and my MAGs. I then separated each marker and clustered them using the default species-level identity.

When I look at a particular marker, say S3.13 Ribosomal S9, I see a total of 1434 fragments identified from my 1767 MAGs (95% gANI). Next, I ran appraise on the clustered markers from the raw reads and the MAGs using the --imperfect option with default identity settings, and separated the binned and unaccounted populations. The binned populations now contain 6402 unique fragments. I am a bit lost on how to interpret this.

Regards, Adi

adityabandla commented 4 months ago

Upon some digging, this seems to arise due to the markers from the MAGs mapping to spurious clusters from the metagenomes, perhaps due to running singleM on the raw reads. Applying a prevalence filter makes the numbers more probable. Perhaps adding this option to summarise when combining multiple samples might be useful.

wwood commented 4 months ago

I'm a little confused about what commands and analysis you've run exactly. Typically you wouldn't cluster the OTU tables before using appraise, instead just specify --imperfect.

What are "spurious clusters from the metagenomes" ?

adityabandla commented 4 months ago

Hi Ben,

I have 120 raw metagenomes and 1700+ MAGs, so I split them into batches and ran singlem pipe

singlem pipe \
    --forward ${R1} --reverse ${R2} \
    --output-otu-table batch_1_otu_table.csv \
    --output-archive-otu-table batch_1_otu_table_archive.csv \
    --threads 128

Then I concatenated the results using singlem summarise

singlem summarise
    --input-otu-tables batch_1_otu_table.csv batch_2_otu_table.csv \
    --output-otu-tables raw_reads_otu_table.tsv

As the otu table, especially from the raw metagenomes, was quite large, I clustered the markers at the default identity. Clustering the otu table containing all 59 markers never finished, so I split each marker and clustered them separately, for example

singlem summarise
     --input-otu-tables S3.59.uS11_bact_otu_table.csv \
     --cluster \
     --output-otu-table S3.59.uS11_bact_otu_table_clustered.csv

singlem summarise
    --input-otu-tables S3.59.uS11_bact_otu_table_clustered.csv S3.59.uS11_bact_otu_table_clustered.csv ..
    --output-otu table raw_reads_otu_table_clustered.tsv

Upon inspecting this `raw_reads_otu_table_clustered.tsv' file, I observed several sequences within each marker that were detected only in one sample at best. These are what I termed spurious clusters, perhaps singleton clusters arising due to sequencing error? (I used the raw reads with no trimming.)

I followed the same steps as above for the MAGs, and then ran appraise as follows:

singlem appraise \
    --metagenome-otu-tables raw_reads_otu_table_clustered.tsv \
    --genome-otu-tables genome_otu_table.tsv \
    --imperfect 

As mentioned above, when I look at a particular marker, say S3.13 ribosomal S9, I see a total of 1434 fragments identified from my 1767 MAGs (95% gANI). When I appraise this against the clustered otu table from the raw metagenomes, the binned otu table gives me 6402 fragments as being binned. For this gene, the otu table from the raw reads contained roughly 36,000 clusters (about 5-8x the amount of unique bacteria/archaea one would expect for these samples, going by our amplicon data). When I apply a 5% prevalence filter on the clustered markers from the raw metagenomes (sequences detected in at least 6/120 samples), this reduces to 1520 fragments being binned (from appraise).

While I understand that there is no requirement to cluster the raw reads (and instead just use appraise --imperfect), I did that to obtain a species-level OTU table for my community analysis. I assume clustering prior to appraise should also give similar results?