xfengnefx / hifiasm-meta

hifiasm_meta - de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads.
MIT License
60 stars 8 forks source link

redundancy of hifiasm-meta and metaflye #22

Closed ye00ye closed 1 year ago

ye00ye commented 1 year ago

hello

i test assembly efficiency of hifiasm-meta and metaflye with mock communty (MSA 1003).

For f5bcb58692924cb7_1 (ATCC-12228 , len: 2503245 bp), hifiasm-meta got 544 contigs, the longest one is 2387482 bp, and the others are shorter than 30000 bp. when I mapped these contigs to the reference genome, I found high redundancy among these contigs, especially the longest contig included lots of shorter contigs. On the other hand, metaflye got one contig, and exactly the length of the reference genome. But for 5964adb8d0df4fde_1 (ATCC-33323, len: 1854273), hifiasm-meta got 8 contigs, the coverage is good and almost no overlap existed among these 8 contigs.

So, i want to ask : 1, why different assembly results appeared for different reference genome; 2, how should I set parameters to get a set of contigs with low redundancy while maintaining high coverage.

the current parameters i set was: hifiasm_meta -t 36 --force-rs -o mock2 ../mock2.fastq.gz

thanks for your help

xfengnefx commented 1 year ago

Are you using SRR11606871? If no and it's a smaller library, try hifiasm_meta without the --force-rs. If yes, could you check version (--version) to make sure it is not between r52-r57 (both inclusive)?

ye00ye commented 1 year ago

I used PRJNA546278 (SRR9328980) from pb-metagenomics-tools in Github, it contains 2,419,037 reads and 20.5G bases.

the hifiasm_meta version I used was 0.3-r061 (hifiasm code base 0.13-r308) ye--ye

@. | ---- Replied Message ---- | From | @.> | | Date | 12/29/2022 10:30 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [xfengnefx/hifiasm-meta] redundancy of hifiasm-meta and metaflye (Issue #22) |

Are you using SRR11606871? If no and it's a smaller library, try hifiasm_meta without the --force-rs. If yes, could you check version (--version) to make sure it is not between r52-r57 (both inclusive)?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xfengnefx commented 1 year ago

I see. Please try assemble it without the read selection (just remove --force-rs; don't use the bin files from the previous run, i.e. -o another_name if to run in the same directory), I think it might have thrown away useful reads while the dataset is not that large...

ye00ye commented 1 year ago

I'm trying it now, thanks for your help

ye--ye

@. | ---- Replied Message ---- | From | @.> | | Date | 12/29/2022 11:00 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [xfengnefx/hifiasm-meta] redundancy of hifiasm-meta and metaflye (Issue #22) |

I see. Please try assemble it without the read selection (just remove --force-rs; don't use the bin files from the previous run, i.e. -o another_name if to run in the same directory), I think it might have thrown away useful reads while the dataset is not that large...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

ye00ye commented 1 year ago

hello!

This time I tried hifiasm-meta without --force-rs, and i got 972 contigs mapped to f5bcb58692924cb7_1 (ATCC-12228), the number was 544 when adding --force-rs. i thought this represented a higher redundancy.

so how should I do to decrease the redundancy. what i care about is not the possibility of throwing useful reads, but the contigs are high redundant, which meant that some part of the f5bcb58692924cb7_1 are covered by at least one contig, just as the picture below showed.

every line meant one contig, and the bar (colored, short or long) represented the mapping situation to reference genome.

the following is the main difference between 'with --force-rs' and 'without --force-rs'

with --force-rs without --force-rs num_contigs 4230 25319
sum_contigs 1.06E+08 3.8E+08 min_len 4439 2203 average_len 25028.4 14882.1 max_len 6.37E+06 6.37E+06

thanks for your help.

ye--ye

@. | ---- Replied Message ---- | From | @.> | | Date | 12/29/2022 11:00 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [xfengnefx/hifiasm-meta] redundancy of hifiasm-meta and metaflye (Issue #22) |

I see. Please try assemble it without the read selection (just remove --force-rs; don't use the bin files from the previous run, i.e. -o another_name if to run in the same directory), I think it might have thrown away useful reads while the dataset is not that large...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xfengnefx commented 1 year ago

For the run without read selection, it seems the contained reads heuristics may spare too many reads. In real datasets it seemed to do more benefits than harm, as I found disconnected contigs are less useful than having a tangle. I need to deal with contained reads more properly.

I added a flag to turn the heuristics off in meta_dev branch, if you want to try (run with --noch -B norsrun -o newprefix, where norsrun is the prefix of the old bin files). Disable both the heuristics and the read selection would assemble a circular ATCC-12228 (Staphylococcus epidermidis) and Helicobacter pylori, but Neisseria meningitidis to be not reported as a circle (although length seems ok).

About your original questions:

duplications among ATCC-12228 contigs

For hifiasm-meta, I suspect it is due to high coverage in the input, but I have not looked in to the details. A few graph cleaning routines might also keep the dropped nodes in the primary graph (does not matter either way...).

Would you see this to be concerning?

to get a set of contigs with low redundancy while maintaining high coverage.

I think the mock datasets are quite unique, and for analyzing real datasets, it seems sequences other than genome bins or long contigs are not that useful. Currently hifiasm-meta is like this, with read selection with fine tuned thresholds it might do better, but I think that is not useful and not fair for comparison.

I will need to determine where are the redundancies and handle them in the future release, along with the contained reads issue.

ATCC-33323

This species was a 0.18% one, the coverage was quite low and has drops (e.g. a 4kb hole in the genome's 302,645-325,077). The contigs terminate mostly at those coverage drops, therefore no overlaps between them.

ye00ye commented 1 year ago

Thanks for your patient answer. Now I know something more in using hifiasm-meta, and I will try to set different parameters as you said. I hope to connect with you properly if I find something interesting.

ye--ye

@. | ---- Replied Message ---- | From | @.> | | Date | 12/31/2022 07:17 | | To | @.> | | Cc | @.> , @.***> | | Subject | Re: [xfengnefx/hifiasm-meta] redundancy of hifiasm-meta and metaflye (Issue #22) |

For the run without read selection, it seems the contained reads heuristics may spare too many reads. In real datasets it seemed to do more benefits than harm, as I found disconnected contigs are less useful than having a tangle. I need to deal with contained reads more properly.

I added a flag to turn the heuristics off in meta_dev branch, if you want to try (run with --noch -B norsrun -o newprefix, where norsrun is the prefix of the old bin files). Disable both the heuristics and the read selection would assemble a circular ATCC-12228 (Staphylococcus epidermidis) and Helicobacter pylori, but Neisseria meningitidis to be not reported as a circle (although length seems ok).

About your original questions:

duplications among ATCC-12228 contigs

For hifiasm-meta, I suspect it is due to high coverage in the input, but I have not looked in to the details. A few graph cleaning routines might also keep the dropped nodes in the primary graph (does not matter either way...).

Would you see this to be concerning?

to get a set of contigs with low redundancy while maintaining high coverage.

I think the mock datasets are quite unique, and for analyzing real datasets, it seems sequences other than genome bins or long contigs are not that useful. Currently hifiasm-meta is like this, with read selection with fine tuned thresholds it might do better, but I think that is not useful and not fair for comparison.

I will need to determine where are the redundancies and handle them in the future release, along with the contained reads issue.

ATCC-33323

This species was a 0.18% one, the coverage was quite low and has drops (e.g. a 4kb hole in the genome's 302,645-325,077). The contigs terminate mostly at those coverage drops, therefore no overlaps between them.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xfengnefx commented 1 year ago

Thank you, closing this for now but please feel free to reopen or post a new thread when needed.