Closed cjain7 closed 2 years ago
For this paragraph you want to look at hamt_hit_contained_drop_singleton_worker_v2
(aka here). There's a comment block that explains the intention and the implementation.
This function takes one read ID and knows the record of all read overlaps. There's no explict output. By the end of the function, if need_to_protect
is set to 1, the contained read will not be discarded as so (this part is handled by a wrapper routine, hamt_hit_contained_drop_singleton_multi
).
The routine you linked to was tried but discarded, I should clean up the code sometime.
Thanks, I'm able to follow the algorithm.
If the read being considered has more than 3 parents, you remove it. Rationale: Avoid protecting too many reads.
Otherwise:
You iterate through the overlapping neighbours of the read in graph which are inferred to be from same haplotypes. Afterwards, you check if they have (strong) intra-haplotype suffix-prefix overlap with the parent reads. If sufficient intra-haplotype overlaps are found between each neighbour to parent reads, then you avoid protecting the read. Rationale: If need_to_protect=0
, then all neighbours can use the parents to continue assembly walk even if the read is deleted.
That's right, or in other words, reads that have the same haplotype as the query read (i.e. the contained read) shall not have inter-haplotype overlaps. Otherwise the example in the comment block is implied.
This neighbor check is a compromise for hifiasm-meta not knowing the exact positions of the phasing variables at this stage.
edit: typo
The manuscript briefly mentions how Hifiasm-meta uses a new method for filtering contained reads. I'm interested in learning about the filtering mechanism here. Could you please share more details of the algorithm ; OR point me to appropriate place in the code. Pasting the text from your manuscript:
I wish to understand the exact condition / threshold values which decides whether to retain the contained read.
Thank you.