xfengnefx / hifiasm-meta

hifiasm_meta - de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads.
MIT License
60 stars 8 forks source link

Potential for improvement: A great test dataset here! #8

Closed GabeAl closed 2 years ago

GabeAl commented 3 years ago

This project is quite exciting, but like you mentioned in your pre-print, there is very little public training data to help optimize for this use-case.

I'd like to point the authors to a substantially larger and more representative dataset: 11 real individual human HiFi fecal metagenomes (which are NOT pooled). They have a more realistic distribution of species (some highly abundant but many lower-abundant ones).

PRJNA754443 11_sra_samples.csv

Expected differences seen in this real dataset compared to the "pooled" samples used to benchmark this:

  1. These new samples have less equitable (but arguably more realistic) distributions of microbes than the pooled samples because you aren't merging multiple non-overlapping sets of high-abundance bugs; there is more of an exponential decay in abundances.
  2. These new samples would be expected to have potentially less tangled graphs, as they are less likely to contain mixtures of near-identical strains from different people in the same sample. Large numbers of closely-related genomes are less likely to be found within a given individual when evolutionary selection has taken place to limit the diversity of closely-related strains competing for the same resources/niches within the gut
  3. Overall depth is slightly lower with a median of roughly 1 million reads of 7kb length.
  4. Despite point 3, there may be more potential to capture rare microbes because these single samples have twice the effective read depth per human subject than the pooled samples which ostensibly have twice the volume of data in total.

I've run the latest version of this assembler on these samples already, and see substantially fewer closed genomes (and overall HQ mags!) per sample than the pooled samples, as expected. I aim to do numerous more experiments with some of the recent cleaning options and potentially other (graph-aware?) binning tweaks, but I don't expect the overall picture to change much.

I'm curious to see whether further improvements can be made given the availability of this larger corpus of individual-level human microbiome HiFi data.

xfengnefx commented 3 years ago

Thank you very much! I did not know this dataset until now, will assemble and see. That's an interesting median read length...

For the pooled samples mentioned in the preprint, my guess is that the pooling is fine and each library is more like a slightly more complex library with ok coverage, rather than 4 samples all with low coverage. We co-assembled humanO1-2 as well as humanV1-2, and found that long contigs from both were not library-specific (bottom of page 3/7 in the arxiv preprint; can't directly validate since the samples were pooled prior to sequencing it seems).

If the aboved guess is somewhat true, then these four libraries have better coverage and read length than PRJNA754443, although with slightly more complexity (I look to measure with 16S rRNA from the Hifi reads).

dportik commented 2 years ago

Glad to see that's on NCBI now! We sequenced that dataset for Siolta, as part of this study: https://www.biorxiv.org/content/10.1101/2021.08.31.458285v1

These are all 4-plex (4 samples per SMRT Cell 8M), so that is why there are fewer reads. There are 12 samples total, but 6 human donors, and two samples are from a single human. There is a pre-treatment and post-treatment per individual, and they are labeled as such (5_treatment_LRM, 5_baseline_LRM). I suppose if coverage is an issue you could combine pre and post samples for each individual and co-assemble.

FYI - The results in the pre-print are based on HiCanu, but I just re-ran these with hifiasm-meta. I am working on the MAG summary now. I hope to substitute these results in the revisions we are working on for the publication.

GabeAl commented 2 years ago

Nice to hear from you again, Daniel!

Of note: Subject 6 only has one of the pairs (due to library prep issues with one sample), so there should be 11 samples total.

Daniel -- could you comment on how the median read count is ~1M reads if they were sequenced 4 per smrt cell? That would mean each SMRTcell produced 4M reads, meaning there would be 16M reads total in the 8M Tray, which is unprecedented. Is this expected? The posters presented earlier by Pacbio showed a max of 2.4M reads per SMRTcell for metagenomics.

Cheerio, Gabe

On Tue, Dec 14, 2021 at 12:27 PM Daniel Portik @.***> wrote:

Glad to see that's on NCBI now! We sequenced that dataset for Siolta, as part of this study: https://www.biorxiv.org/content/10.1101/2021.08.31.458285v1

These are all 4-plex (4 samples per SMRT Cell 8M), so that is why there are fewer reads. There are 12 samples total, but 6 human donors, and two samples are from a single human. There is a pre-treatment and post-treatment per individual, and they are labeled as such ( 5_treatment_LRM, 5_baseline_LRM). I suppose if coverage is an issue you could combine pre and post samples for each individual and co-assemble.

FYI - The results in the pre-print are based on HiCanu, but I just re-ran these with hifiasm-meta. I am working on the MAG summary now. I hope to substitute these results in the revisions we are working on for the publication.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xfengnefx/hifiasm-meta/issues/8#issuecomment-993810337, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5NOBTFB62AZQO7BR7QTL3UQ5467ANCNFSM5HOGQVBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

dportik commented 2 years ago

Hi Gabe! Correct - I think a barcode step was missed for one sample and so it unfortunately wasn't sequenced. So, five paired samples and one that is unpaired.

The HiFi read count per sample ranged from 700,000 to 1.2 million. It looks like for two of the cells the total HiFi yield was 3.2-3.5 million reads, so that makes sense for the 4-plex. We would definitely consider 2-2.5 million HiFi reads a success for metagenomics, but it is not necessarily the maximum yield for a SMRT Cell 8M. The total HiFi yield depends on many factors, and we do see variation across runs and applications. Increasing yield is a big priority and so I would expect these numbers to climb and be more consistent moving forward.

xfengnefx commented 2 years ago

Hi Daniel and Gabe,

Thanks for mentioning the treatment/baseline labels. Drop in to add something about coassembly - most long contigs are individual-specific (while baseline/post-treatment of a same individual are not sperated, which I think is expected), but there's a few of them are mixed. Pooling by individual instead of pooling all 11 might give a better total yield, but I haven't tried yet.

gut11

^ Sorting order: counts of reads, per donor. A bar with only one color means that all or almost all of its reads came from one individual. Bar with mixed colors might suggest the sequence was shared between individuals, if not misassembly.