xingjianleng / DBGA

The repository for the genome sequence alignment research project
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

extending dbga to use a different substitution model #20

Open GavinHuttley opened 1 year ago

GavinHuttley commented 1 year ago

At present it's hard-coded to using a nucleotide model, but it could be extended to using a codon or amino-acid model.

What lines in dbga are most pertinent to this?

GavinHuttley commented 1 year ago

Related to this, using a codon model for alignment of bubbles means we need to ensure the sequences are "in frame". This relates to the choice of sequence segments to be sent to the cogent3 aligner. Need to also know what lines are pertinent to this.

xingjianleng commented 1 year ago

In amino-acid case, we may be able to just change the algorithm for aligning the bubbles. Currently we are using the a wrapper function of the cogent3 DNA alignment for both pairwise and multiple sequence alignment. If we are able to change the model for cogent3 alignment, then the DBGA can be easily extended to amino-acid case. https://github.com/xingjianleng/DBGA/blob/0da8dca853a98168dd858ad826b126827ee322b9/src/dbga/utils.py#L157-L263

However, the codon case is much more complex. I'm a little bit unsure what does "in frame" mean. If the "in frame" is relative to whole genome sequences, we might have to rewrite the alignment() function in both debruijn_pairwise.py and debruijn_msa.py, as the merge k-mers might affect these frames. However, if the frame are relative to bubbles, we can simply extend the current alignment algorithm (the block of code referred above) to using a codon model.

GavinHuttley commented 1 year ago

First step is to change to using a pairwise distance, which gives a tree with length equally divided. We can then use the conventional cogent3 algorithm and thus get access to different substitution model forms

xingjianleng commented 1 year ago

Divide-and-conquer was used in de Bruijn alignment. Will this affect the substitution models in cogent3 algorithm, as we kind of losing the information about the whole genome sequence?

GavinHuttley commented 1 year ago

I do not think so because you are already slicing the original sequences for sending to cogent3. This issue implies only (for the codon model) that our choice of the coordinate to slice the original sequences is modulo 3.