Testing of COVID genome sequence alignment on different algorithms and software

xingjianleng commented 2 years ago

Use the COVID genome sequence dataset to test different behaviors of alignment algorithms. Should pay attention to different alignment results, time and memory usage.

[x] Obtain the COVID genome sequence dataset
[x] Do the trials of alignment algorithms (e.g. Needleman-Wunsch, Cogent3, Muscle) and test their performance on the COVID sequence alignment.
[x] Extra attention needed for choosing appropriate parameters for Muscle alignment software.
[x] Record their results, time and memory usage.

biolinyu commented 2 years ago

If it is possible, you may also run the same test on POA.

GavinHuttley commented 2 years ago

Attached is a zip archive of 22 different corona virus genomes from GenBank, in GenBank format. Each file contains a single corona virus genome. I suggest you extract the sequences and write them out to a single file in fasta format.

How to do that is illustrated below.

import pathlib
from cogent3 import make_unaligned_seqs
from cogent3.parse.genbank import MinimalGenbankParser

seqs = {}
for fn in pathlib.Path("/Users/gavin/repos/corona/data/raw").glob("*.gb"):
    with fn.open() as infile:
        d = list(MinimalGenbankParser(infile))[0]
        name = d["locus"]
        assert name not in seqs
        seq = d["sequence"]
        seqs[name] = seq

seqs = make_unaligned_seqs(data=seqs, moltype="dna")
seqs.write("~/Desktop/Outbox/corona-unaligned.fasta")

raw.zip

GavinHuttley commented 2 years ago

As you do your benchmarking, think about how you might computationally control this. That is, a script that spawns the specified algorithm and applies the measurement methods to it.

xingjianleng / DBGA

Testing of COVID genome sequence alignment on different algorithms and software #2