xfengnefx / hifiasm-meta

hifiasm_meta - de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads.
MIT License
60 stars 8 forks source link

General question regarding treatment of contained reads #12

Closed cjain7 closed 2 years ago

cjain7 commented 2 years ago

The manuscript briefly mentions how Hifiasm-meta uses a new method for filtering contained reads. I'm interested in learning about the filtering mechanism here. Could you please share more details of the algorithm ; OR point me to appropriate place in the code. Pasting the text from your manuscript:

Treatment of contained reads. The standard procedure to construct a string graph discards a read contained in a longer read. This may lead to an assembly gap if the contained read and the longer read actually reside on different haplotypes10. The original hifiasm patches such gaps by rescuing contained reads after graph construction. Hifiasm-meta tries to resolve the issue before graph construction instead. It retains a contained read if other reads exactly overlapping with the read are inferred to come from different haplotypes. In other words, hifiasm-meta only drops a contained read if there are no other similar haplotypes around it. This strategy often retains extra contained reads that are actually redundant. These extra reads usually lead to bubble-like subgraphs and are later removed by the bubble popping algorithm in the original hifiasm.

I wish to understand the exact condition / threshold values which decides whether to retain the contained read.

Thank you.

cjain7 commented 2 years ago

Perhaps this is where I should look.

xfengnefx commented 2 years ago

For this paragraph you want to look at hamt_hit_contained_drop_singleton_worker_v2 (aka here). There's a comment block that explains the intention and the implementation.

This function takes one read ID and knows the record of all read overlaps. There's no explict output. By the end of the function, if need_to_protect is set to 1, the contained read will not be discarded as so (this part is handled by a wrapper routine, hamt_hit_contained_drop_singleton_multi).

The routine you linked to was tried but discarded, I should clean up the code sometime.

cjain7 commented 2 years ago

Thanks, I'm able to follow the algorithm.

If the read being considered has more than 3 parents, you remove it. Rationale: Avoid protecting too many reads. Otherwise: You iterate through the overlapping neighbours of the read in graph which are inferred to be from same haplotypes. Afterwards, you check if they have (strong) intra-haplotype suffix-prefix overlap with the parent reads. If sufficient intra-haplotype overlaps are found between each neighbour to parent reads, then you avoid protecting the read. Rationale: If need_to_protect=0, then all neighbours can use the parents to continue assembly walk even if the read is deleted.

xfengnefx commented 2 years ago

That's right, or in other words, reads that have the same haplotype as the query read (i.e. the contained read) shall not have inter-haplotype overlaps. Otherwise the example in the comment block is implied.

This neighbor check is a compromise for hifiasm-meta not knowing the exact positions of the phasing variables at this stage.

edit: typo