xingjianleng / DBGA

The repository for the genome sequence alignment research project
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

Project description and milestones #8

Closed xingjianleng closed 2 years ago

xingjianleng commented 2 years ago

Description:

Goals:

To provide a pairwise sequence alignment algorithm with lower time and memory complexity.

Problem to solve:

Current sequence alignment algorithms don't scale well when the input sequences are increasingly longer. One of the most popular sequence alignment software "MUSCLE" leads to Segmentation Fault when aligning two sequences with 30,000 lengths. So, I would like to experiment with whether there is an efficient way to align two closely related sequences.

Possible approaches of solutions:

Generally, we are converting the sequences into a de Bruijn graph then map to a partial order graph to do the alignment. We can first construct the de Bruijn graph for two input sequences. Then, we could use some decycling techniques to ensure the de Bruijn graph doesn't contain any circles. For the linear part of the de Bruijn graph, we can add these nodes to a partial order graph. For bubbles in the graph, we can call partial order alignment to get the solution to sub-problems and add them to the partial order graph.

Five keywords description:

Bioinformatics; Pairwise sequence alignment; de Bruijn graph, Partial order graph; Graph theory;

Milestones:

Semester 1:

Week 1 - 4: Literature review Week 5 - 7: Fast prototype implementation Week 8 - 9: Code refactoring & Adaptation to corner cases Week 10 - 12: Report introduction draft

Semester 2:

Week 1 - 3: Rewriting the algorithm with a more efficient programming language Week 4 - 8: Finishing the rest part of the report & preparing for presentation Week 9 - 11: Polishing the report and presentation

biolinyu commented 2 years ago

@xingjianleng

You might combine the description into a single paragraph, e.g.,

Project Descriprion: The goal of this project is to develop efficient graph-based sequence alignment algorithms. Current sequence alignment algorithms don't scale well with long sequences, even when these sequences are closely related. We plan to investigate whether the de Bruijn graph model enables an efficient way to align closely related sequences. More specifically, we first build a de Bruijn graph from input sequences, covert the de Bruijn graph into the partial order graph, and finally output the sequence alignment. We will investigate how to decycle the de Bruijn graph and collapse its bubbles into local partial order graphs in a divide-and-conquer way.

Comments:

  1. We might have the chance to develop a multiple sequence alignment algorithm (e.g. for hundreds/thousands of closely related virus strains) in S2 2022 abd thus I removed the word "pairwise".
  2. I removed the MUSCLE example in the project description as MUSCLE seems not to support long input sequences (not a scalibility issue), e.g., " there is no explicit limit on the length of a sequence, however if you are running a 32-bit version of muscle then the maximum will be very roughly 10,000 letters due to maximum addressable size of tables required in memory."
GavinHuttley commented 2 years ago

I endorse Yu Lin's remarks. (But you can still include it in the final report.)

xingjianleng commented 2 years ago

@biolinyu @GavinHuttley Thanks for the comments!