Bioinformatics

Code inspired by Bioinformatics Algorithms: an Active Learning Approach and by Rosalind.

NB: functions generally use zero based indexing; Rosalind uses 1-based.

I have written this code in Python 3, usually with the most recent release at the time each module was written-- 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) at the time this file was last updated. I have used newly added features whenever they might make me more productive. I've been writing code since 1968, during which time computer have got faster, but my brain hasn't. You are welcome to use this code if it is useful to you, subject to the terms of the Licence, but it may need to be adapted if you use an earlier version of Python.

Bioinformatics Algorithms TextBook Track

1. Where in the Genome does DNA replication begin?

#	Location	Description
BA1A	replication.py	Compute the Number of Times a Pattern Appears in a Text
BA1B	replication.py	Find the Most Frequent Words in a String
BA1C	replication.py	Find the Reverse Complement of a String
BA1D	replication.py	Find All Occurrences of a Pattern in a String
BA1E	replication.py	Find Patterns Forming Clumps in a String
BA1F	replication.py	Find a Position in a Genome Minimizing the Skew
BA1G	replication.py	Compute the Hamming Distance Between Two Strings
BA1H	replication.py	Find All Approximate Occurrences of a Pattern in a String
BA1I	replication.py	Find the Most Frequent Words with Mismatches in a String
BA1J	replication.py	Find Frequent Words with Mismatches and Reverse Complements
BA1K	replication.py	Generate the Frequency Array of a Strings
BA1L	replication.py	Implement PatternToNumber
BA1M	replication.py	Implement NumberToPattern
BA1N	replication.py	Generate the d-Neighborhood of a String
-	count_kmers.py	Challenge Problems from Bioinformatics Algorithms: determine number of times each kmer appears in DNA, allowing for reverse complements and d-neighbourhoods.

2. Which DNA Patterns Play the Role of Molecular Clocks?

#	Location	Description
BA2A	sequence.py	Implement MotifEnumeration
BA2B	sequence.py	Find a Median String
BA2C	sequence.py	Find a Profile-most Probable k-mer in a String
BA2D	sequence.py	Implement GreedyMotifSearch
BA2E	sequence.py	Implement GreedyMotifSearch with Pseudocounts
BA2F	sequence.py	Implement RandomizedMotifSearch
BA2G	sequence.py	Implement GibbsSampler
BA2H	sequence.py	Implement DistanceBetweenPatternAndStrings

3. How Do We Assemble Genomes? See also Genome Assembly

#	Location	Description
BA3A	assemble.py	Generate the k-mer Composition of a String
BA3B	assemble.py	Reconstruct a String from its Genome Path
BA3C	assemble.py	Construct the Overlap Graph of a Collection of k-mers
BA3D	assemble.py	Construct the De Bruijn Graph of a String
BA3E	assemble.py	Construct the De Bruijn Graph of a Collection of k-mers
BA3F	assemble.py	Find an Eulerian Cycle in a Graph
BA3G	assemble.py	Find an Eulerian Path in a Graph
BA3H	assemble.py	Reconstruct a String from its k-mer Composition
BA3I	assemble.py	Find a k-Universal Circular String
BA3J	assemble.py	Reconstruct a String from its Paired Composition
asmq	ASMQ.py	Assessing Assembly Quality with N50 and N75
corr	CORR.py	Error Correction in Reads
gasm	GASM.py	Genome Assembly Using Reads
grep	GREP.py	Genome Assembly with Perfect Coverage and Repeats
KMER	assemble.py	Generalizing GC-Content
long	LONG.py	Genome Assembly as Shortest Superstring
pcov	PCOV.py	Genome Assembly with Perfect Coverage

4. How Do We Sequence Antibiotics? See also Computational Mass Spectrometry

#	Location	Description
BA4A	spectrum.py	Translate an RNA String into an Amino Acid String
BA4B	spectrum.py	Find Substrings of a Genome Encoding a Given Amino Acid String
BA4C	spectrum.py	Generate the Theoretical Spectrum of a Cyclic Peptide
BA4D	spectrum.py	Compute the Number of Peptides of Given Total Mass
BA4E	spectrum.py	Find a Cyclic Peptide with Theoretical Spectrum Matching an Ideal Spectrum
BA4F	spectrum.py	Compute the Score of a Cyclic Peptide Against a Spectrum
BA4G	spectrum.py	Implement LeaderboardCyclopeptideSequencing
BA4H	spectrum.py	Generate the Convolution of a Spectrum
BA4I	spectrum.py	Implement ConvolutionCyclopeptideSequencing
BA4J	spectrum.py	Generate the Theoretical Spectrum of a Linear Peptide
BA4K	spectrum.py	Compute the Score of a Linear Peptide
BA4L	spectrum.py	Trim a Peptide Leaderboard
BA4M	ba4m.py spectrum.py	Solve the Turnpike Problem
-	tyrocidine.py	Challenge Problems from Bioinformatics Algorithms: sequence Tyrocidine from supplied spectrum WIP
conv	conv.py+spectrum.py	Comparing Spectra with the Spectral Convolution
full	full.py+spectrum.py	Inferring Peptide from Full Spectrum
prtm	spectrum.py	Calculating Protein Mass - get_weight(...)
prsm	prsm.py+spectrum.py	Matching a Spectrum to a Protein
spec	spec.py+spectrum.py	Inferring Protein from Spectrum
sgra	sgra.py+spectrum.py	Using the Spectrum Graph to Infer Peptides

5. How do we compare Biological Sequences? See also Alignment

#	Location	Description
BA5A	align.py	Find the Minimum Number of Coins Needed to Make Change
BA5B	align.py	Find the Length of a Longest Path in a Manhattan-like Grid
BA5C	align.py	Find a Longest Common Subsequence of Two Strings
BA5D	align.py	Find the Longest Path in a DAG
BA5E	align.py	Find a Highest-Scoring Alignment of Two Strings
BA5F	align.py	Find a Highest-Scoring Local Alignment of Two Strings
BA5G	align.py	Compute the Edit Distance Between Two Strings
BA5H	BA5H.py align.py	Find a Highest-Scoring Fitting Alignment of Two Strings
BA5I	BA5I.py align.py	Find a Highest-Scoring Overlap Alignment of Two Strings
BA5J	BA5J.py align.py	Align Two Strings Using Affine Gap Penalties
BA5K	BA5K.py align.py	Find a Middle Edge in an Alignment Graph in Linear Space
BA5L	BA5L.py align.py	Align Two Strings Using Linear Space WIP
BA5M	BA5M.py align.py	Find a Highest-Scoring Multiple Sequence Alignment
BA5N	align.py align.py	Find a Topological Ordering of a DAG
ctea	ctea.py align.py	Counting Optimal Alignments
edit	edit.py align.py	Edit Distance
edta	edta.py align.py	Edit Distance Alignment
gaff	gaff.py align.py	Global Alignment with Scoring Matrix and Affine Gap Penalty
gcon	gcon.py align.py	Global Alignment with Constant Gap Penalty
GLOB	glob.py align.py	Global Alignment with Scoring Matrix
hamm	rosalind.py	Counting Point Mutations
laff	laff.py	Local Alignment with Affine Gap Penalty
loca	loca.py align.py	Local Alignment with Scoring Matrix
mgap	mgap.py align.py	Maximizing the Gap Symbols of an Optimal Alignment
mult	mult.py align.py	Multiple Alignment
oap	oap.py align.pys	Overlap Alignment
osym	osym.py align.py	Isolating Symbols in Alignments WIP
pdst	align.py	Creating a Distance Matrix
sims	sims.py align.py	Finding Mutated Motifs
smgb	smgb.py align.py	Semiglobal Alignment
tran	rosalind.py	Transitions and Transversions

6. Are there fragile regions in the human genome?

#	Location	Description
BA6A	BA6A.py fragile.py	Implement GreedySorting to Sort a Permutation by Reversals
BA6B	BA6B.py fragile.py	Compute the Number of Breakpoints in a Permutation
BA6C	BA6C.py fragile.py	Compute the 2-Break Distance Between a Pair of Genomes
BA6D	BA6D.py	Find a Shortest Transformation of One Genome into Another by 2-Breaks WIP
BA6E	fragile.py	Find All Shared k-mers of a Pair of Strings
BA6F	BA6F.py fragile.py	Implement Chromosome to Cycle
BA6G	BA6G.py fragile.py	Implement Cycle to Chromosome
BA6H	BA6H.py fragile.py	Implement Coloured Edges
BA6I	BA6I.py fragile.py	Implement GraphToGenome
BA6J	BA6J.py fragile.py	Implement 2-BreakOnGenomeGraph
BA6K	BA6K.py fragile.py	Implement 2-BreakOnGenome

7. Which animal gave us SARS? See also Phylogeny

#	Location	Description
BA7A	BA7A.py phylogeny.py	Compute Distances Between Leaves
BA7B	BA7B.py phylogeny.py	Limb Length Problem
BA7C	BA7C.py phylogeny.py	Implement Additive Phylogeny See also
BA7D	BA7D.py phylogeny.py	Implement UPGMA
BA7E	BA7E.py phylogeny.py	Implement the Neighbor Joining Algorithm
BA7F	BA7F.py phylogeny.py	Implement SmallParsimony
BA7G	BA7G.py phylogeny.py	Adapt SmallParsimony to Unrooted Trees
alph	alph.py phylogeny.py	Alignment-Based Phylogeny
chbp	chbp.py phylogeny.py	Character-Based Phylogeny
cntq	cntq.py phylogeny.py	Counting Quartets
cset	cset.py phylogeny.py	Fixing an Inconsistent Character Set
cstr	cstr.py phylogeny.py	Creating a Character Table from Genetic Strings
ctbl	ctbl.py phylogeny.py	Creating a Character Table
cunr	phylogeny.py	Counting Unrooted Binary Trees
eubt	eubt.py phylogeny.py	Enumerating Unrooted Binary Trees
inod	-	Counting Phylogenetic Ancestors No code needed
mend	mend.py phylogeny.py	Inferring a Genotype from a Pedigree
nkew	nwck.py	Weighting the Tree
nwck	nwck.py	Distances in Trees
pdst	rosalind.py	Creating a Distance Matrix
qrt	qrt.py phylogeny.py	Quartets
qrtd	qrtd.py phylogeny.py	Quartet Distance WIP
root	phylogeny.py	Counting Rooted Binary Trees
rsub	rsub.py phylogeny.py	Identifying Reversing Substitutions
sptd	sptb.py phylogeny.py	Phylogeny Comparison with Split Distance
tree	tree.py phylogeny.py	Completing a Tree

8. How did Yeast become a Winemaker?

#	Location	Description
BA8A	BA8A.py	Implement FarthestFirstTraversal
BA8B	BA8B.py	Compute the Squared Error Distortion
BA8C	BA8C.py	Implement the Lloyd Algorithm for k-Means Clustering
BA8D	BA8D.py	Implement the Soft k-Means Clustering Algorithm
BA8E	BA8E.py	Implement Hierarchical Clustering

9. How do we locate disease causing mutations?

#	Location	Description
snp.py	Code for: How do we locate disease causing mutations?
BA9A	BA9A.py snp.py	Construct a Trie from a Collection of Patterns
BA9B	BA9B.py snp.py	Implement TrieMatching
BA9C	BA9C.py snp.py	Construct the Suffix Tree of a String
BA9D	BA9D.py	Find the Longest Repeat in a String
BA9E	BA9E.py snp.py	Find the Longest Substring Shared by Two Strings
BA9F	BA9F.py	Find the Shortest Non-Shared Substring of Two Strings WIP
BA9G	BA9G.py snp.py	Construct the Suffix Array of a String
BA9H	BA9H.py	Pattern Matching with the Suffix Array
BA9I	BA9I.py snp.py	Construct the Burrows-Wheeler Transform of a String
BA9J	BA9J.py snp.py	Reconstruct a String from its Burrows-Wheeler Transform
BA9K	BA9K.py snp.py	Generate the Last-to-First Mapping of a String
BA9L	BA9L.py snp.py	Implement BWMatching
BA9M	BA9M.py snp.py	Implement BetterBWMatching
BA9N	BA9N.py snp.py	Find All Occurrences of a Collection of Patterns in a String
BA9O	BA9O.py	Find All Approximate Occurrences of a Collection of Patterns in a String WIP
BA9P	BA9P.py	Implement Tree Colouring
BA9Q	BA9Q.py snp.py	Construct the Partial Suffix Array of a String
BA9R	BA9R.py	Construct a Suffix Tree from a Suffix Array WIP

10. Why have biologists still not developed an HIV vaccine?

#	Location	Description
BA10A	BA10A.py hmm.py	Compute the Probability of a Hidden Path
BA10B	BA10B.py hmm.py	Compute the Probability of an Outcome Given a Hidden Path
BA10C	BA10C.py hmm.py	Implement the Viterbi Algorithm
BA10D	BA10D.py hmm.py	Compute the Probability of a String Emitted by an HMM
BA10E	BA10E.py hmm.py	Construct a Profile HMM
BA10F	BA10F.py hmm.py	Construct a Profile HMM with Pseudocounts
BA10G	BA10G.py hmm.py	Perform a Multiple Sequence Alignment with a Profile HMM WIP
BA10H	BA10H.py hmm.py	Estimate the Parameters of an HMM
BA10I	BA10I.py hmm.py	Implement Viterbi Learning
BA10J	BA10J.py hmm.py	Solve the Soft Decoding Problem
BA10K	BA10K.py hmm.py	Implement Baum-Welch Learning

11. Was T.rex just a big chicken?

#	Location	Description
BA11A	BA11A.py spectrum.py	Spectrum Graph Construction
BA11B	BA11B.py spectrum.py	Implement DecodingIdealSpectrum
BA11C	BA11C.py spectrum.py	Convert a Peptide into a Peptide Vector
BA11D	BA11D.py spectrum.py	Convert a Peptide Vector into a Peptide
BA11E	BA11E.py spectrum.py	Sequence a Peptide
BA11F	BA11F.py spectrum.py	Find a Highest-Scoring Peptide in a Proteome against a Spectrum WIP
BA11G	BA11G.py spectrum.py	Implement PSMSearch WIP
BA11H	BA11H.py spectrum.py	Compute the Size of a Spectral Dictionary
BA11I	BA11I.py spectrum.py	Compute the Probability of a Spectral Dictionary
BA11J	BA11J.py spectrum.py	Find a Highest-Scoring Modified Peptide against a Spectrum WIP

Combinatorics

#	Location	Description
combinatorics.py	Combinatorics
aspc	rosalind.py	Introduction to Alternative Splicing
cat	cat.py combinatorics.py	Catalan Numbers and RNA Secondary Structures
cntq	cntq.py	Counting Quartets
cunr	phylogeny.py	Counting Unrooted Binary Trees
eubt	eubt.py phylogeny.py	Enumerating Unrooted Binary Trees
fibd	rosalind.py	Mortal Fibonacci Rabbits
fib	rosalind.py	Rabbits and Recurrence Relations
motz	motz.py combinatorics.py	Motzkin Numbers and RNA Secondary Structures
mrep	mrep.py	Identifying Maximal Repeats
mrna	rosalind.py	Inferring mRNA from Protein
orf	orf.py	Open Reading Frames
pmch	pmch.py	Perfect Matchings and RNA Secondary Structures
pper	rosalind.py	Partial Permutations
rnas	rnas.py	Wobble Bonding and RNA Secondary Structures WIP
root	phylogeny.py	Counting Rooted Binary Trees

Genome Rearrangements

#	Location	Description
lgis	lgis.py	Longest Increasing Subsequence
rear	rear.py fragile.py	Reversal distance
sort	sort.py fragile.py	Sorting by Reversals

Graphs & Graph Algorithms

#	Location	Description
2sat	sat.py graphs.py	2-Satisfiability
bf	bf.py graphs.py	Bellman-Ford Algorithm
bfs	bfs.py graphs.py	Breadth First search
bins	BINS.py	Binary Search
bip	bip.py graphs.py	Testing Bipartiteness
cc	cc.py graphs.py	Connected Components
cte	cte.py graphs.py	Shortest Cycle Through a Given Edge
dag	dag.py graphs.py	Testing Acyclicity
ddeg	DDEG.py graphs.py	Double-Degree Array
deg	DEG.py graphs.py	Degree Array
dfs	graphs.py	Depth First Search ???
dij	dij.py graphs.py	Dijkstra's Algorithm
grph	graphs.py	Overlap Graphs
gs	gs.py graphs.py	General Sink
hdag	hdag.py graphs.py	Hamiltonian Path in DAG
lrep	lrep.py	Finding the Longest Multiple Repeat WIP
nwc	nwc.py graphs.py	Negative Weight Cycle
scc	scc.py graphs.py	Strongly Connected Components
sc	sc.py graphs.py	Semi-Connected Graph
sdag	sdag.py graphs.py	Shortest Paths in DAG
sq	sq.py graphs.py	Square in a Graph
suff	suff.py graphs.py	Creating a Suffix Tree
tree	tree.py phylogeny.py	Completing a Tree
trie	trie.py snp.py	Introduction to Pattern Matching
ts	ts.py align.py	Topological sort

Heredity

#	Location	Description
iev	rosalind.py	Calculating Expected Offspring
indq	rosalind.py	Independent Segregation of Chromosomes
iprb	rosalind.py	Mendel's First Law
lia	rosalind.py	Independent Alleles
mend	mend.py phylogeny.py	Inferring a Genotype from a Pedigree
sexl	rosalind.py	Sex-Linked Inheritance

Population Dynamics

#	Location	Description
afrq	rosalind.py	Counting Disease Carriers
foun	rosalind.py	The Founder Effect and Genetic Drift
wfmd	rosalind.py	The Wright-Fisher Model of Genetic Drift

Probability

#	Location	Description
ebin	rosalind.py	Wright-Fisher's Expected Behavior
eval	rosalind.py	Expected Number of Restriction Sites
mend	mend.py phylogeny.py	Inferring a Genotype from a Pedigree
rstr	rosalind.py	Matching Random Motifs

Sequence Analysis

#	Location	Description
frmt	FRMT.py	Data Formats
gbk	GBK.py	GenBank Introduction
orfr	orfr.py	Finding Genes with Open Reading Frames
ptra	PTRA.py	Protein Translation

Sorting

#	Location	Description
2sum	2sum.py	2SUM
3sum	3sum.py	3SUM
ins	INS.py	Insertion Sort
inv	INV.py	Counting Inversions
med	MED.py	Median
mer	MER.py	Merge Two Sorted Arrays
mprt	MPRT.py	Finding a Protein Motif
ms	MS.py	Mergesort
par	PAR.py	2-Way Partition
par3	PAR3.py	3-Way Partition
ps	ps.py	Partial sort
qs	QS.py	Quicksort

Strings & String Algorithms

#	Location	Description
cons	rosalind.py	Consensus and Profile
dna	dna.py rosalind.py	Counting DNA Nucleotides
gc	rosalind.py	Computing GC Content
itwv	itwv.py align.py	Finding Disjoint Motifs in a Gene
kmer	rosalind.py	Generalizing GC-Content
kmp	kmp.py	Speeding Up Motif Finding
ksim	ksim.py	Finding All Similar Motifs WIP
lcsm	rosalind.py	Finding a Shared Motif
lcsq	lcsq.py	Finding a Shared Spliced Motif
ling	ling.py	Linguistic Complexity of a Genome
ukkonen.py	µTuX's implementation of Ukkonen's algorithm
mmch	mmch.py	Maximum Matchings and RNA Secondary Structures
prot	spectrum.py	Translating RNA into Protein
revc	rosalind.py	Complementing a Strand of DNA
revp	revp.py rosalind.py	Locating Restriction Sites
rna	rna.py rosalind.py	Transcribing DNA into RNA
scsp	scsp.py	Interleaving Two Motifs
splc	spectrum.py	RNA Splicing
sseq	rosalind.py	Finding a Spliced Motif
subs	rosalind.py	Finding a Motif in DNA

Set Theory

#	Location	Description
pdpl	pdpl	Creating a Restriction Map
seto	-	Set operations Solved using Python shell
sset	-	Counting Subsets Solved with a single line in Python shell: 2**n%1000000

#	Location	Description
fibo	fibo.py	Fibonacci numbers

C++ code

Location	Header	Description
Makefile	for C++ code
newick.cpp	newick.h	Parser for files in Newick format
qrtd.cpp	qrtd.h	Quartet Distance WIP
test-newick.cpp	Test Newick Parser
test-qrtd.cpp	Test Quartet Distance WIP
test-tree.cpp	Tests for tree.cpp
tests.cpp	catch.hpp	C++ test framework
tree.cpp	tree.h	Code representing a phylogeny

Support

Location	Description
bioinformatics.bib	Bibliography
findtests.py	Generates shell script to execute all tests
helpers.py	Utilities for formatting output, parsing input, etc
LICENSE	License Agreement
newick.py	Parser for files in Newick format
README.md	This file
rosalind.py	Shared code
rosalind.wpr	WingWare Project File
template.py	Template for getting started on a problem

weka511 / bioinformatics

readme