Closed xingjianleng closed 2 years ago
If it is possible, you may also run the same test on POA.
Attached is a zip archive of 22 different corona virus genomes from GenBank, in GenBank format. Each file contains a single corona virus genome. I suggest you extract the sequences and write them out to a single file in fasta format.
How to do that is illustrated below.
import pathlib
from cogent3 import make_unaligned_seqs
from cogent3.parse.genbank import MinimalGenbankParser
seqs = {}
for fn in pathlib.Path("/Users/gavin/repos/corona/data/raw").glob("*.gb"):
with fn.open() as infile:
d = list(MinimalGenbankParser(infile))[0]
name = d["locus"]
assert name not in seqs
seq = d["sequence"]
seqs[name] = seq
seqs = make_unaligned_seqs(data=seqs, moltype="dna")
seqs.write("~/Desktop/Outbox/corona-unaligned.fasta")
As you do your benchmarking, think about how you might computationally control this. That is, a script that spawns the specified algorithm and applies the measurement methods to it.
Use the COVID genome sequence dataset to test different behaviors of alignment algorithms. Should pay attention to different alignment results, time and memory usage.