xingjianleng / DBGA

The repository for the genome sequence alignment research project
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

identify lines where modifications for bubble mismatch can be identified #19

Open GavinHuttley opened 1 year ago

GavinHuttley commented 1 year ago

we want to collapse bubbles with size k since those will just be single-base mismatch

What lines in dbga are most pertinent to this?

xingjianleng commented 1 year ago

Currently, there is one attribute in debruijn_pairwise.py and debruijn_msa.py named self.expansion, which includes the merge k-mer indices and node indices in bubbles. However, as suggested in https://github.com/xingjianleng/DBGA/issues/17, we should move this calculation after alignment() is called.

By using the expansion variable, we can obtain the correspondence between bubbles from each sequence (they should appear at the same index in the expansion for each sequence, i.e., we can use [expansion[j][i] for j in range(num_seqs)] to extract bubbles for all sequences).

In debruijn_pairwise.py, current implementation didn't use the expansion. We should refactor the alignment() function to take the advantage of expansion (similar to alignment() in debris_msa.py). Then, change according to the aforementioned approach.

In debruijn_msa.py, we should change the block of code below with the approach mentioned above. We should compare the length of bubbles for each sequence. If their difference is 1, we may be able to collapse the bubble rather than calling cogent3 alignment. https://github.com/xingjianleng/DBGA/blob/0da8dca853a98168dd858ad826b126827ee322b9/src/dbga/debruijn_msa.py#L485-L509