weka511 / bioinformatics

Code inspired by Bioinformatics Algorithms: an Active Learning Approach.
GNU General Public License v3.0
115 stars 35 forks source link

Bioinformatics

Code inspired by Bioinformatics Algorithms: an Active Learning Approach and by Rosalind.

NB: functions generally use zero based indexing; Rosalind uses 1-based.

I have written this code in Python 3, usually with the most recent release at the time each module was written-- 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) at the time this file was last updated. I have used newly added features whenever they might make me more productive. I've been writing code since 1968, during which time computer have got faster, but my brain hasn't. You are welcome to use this code if it is useful to you, subject to the terms of the Licence, but it may need to be adapted if you use an earlier version of Python.

Bioinformatics Algorithms TextBook Track

1. Where in the Genome does DNA replication begin?

# Location Description
BA1A replication.py Compute the Number of Times a Pattern Appears in a Text
BA1B replication.py Find the Most Frequent Words in a String
BA1C replication.py Find the Reverse Complement of a String
BA1D replication.py Find All Occurrences of a Pattern in a String
BA1E replication.py Find Patterns Forming Clumps in a String
BA1F replication.py Find a Position in a Genome Minimizing the Skew
BA1G replication.py Compute the Hamming Distance Between Two Strings
BA1H replication.py Find All Approximate Occurrences of a Pattern in a String
BA1I replication.py Find the Most Frequent Words with Mismatches in a String
BA1J replication.py Find Frequent Words with Mismatches and Reverse Complements
BA1K replication.py Generate the Frequency Array of a Strings
BA1L replication.py Implement PatternToNumber
BA1M replication.py Implement NumberToPattern
BA1N replication.py Generate the d-Neighborhood of a String
- count_kmers.py Challenge Problems from Bioinformatics Algorithms: determine number of times each kmer appears in DNA, allowing for reverse complements and d-neighbourhoods.

2. Which DNA Patterns Play the Role of Molecular Clocks?

# Location Description
BA2A sequence.py Implement MotifEnumeration
BA2B sequence.py Find a Median String
BA2C sequence.py Find a Profile-most Probable k-mer in a String
BA2D sequence.py Implement GreedyMotifSearch
BA2E sequence.py Implement GreedyMotifSearch with Pseudocounts
BA2F sequence.py Implement RandomizedMotifSearch
BA2G sequence.py Implement GibbsSampler
BA2H sequence.py Implement DistanceBetweenPatternAndStrings

3. How Do We Assemble Genomes? See also Genome Assembly

# Location Description
BA3A assemble.py Generate the k-mer Composition of a String
BA3B assemble.py Reconstruct a String from its Genome Path
BA3C assemble.py Construct the Overlap Graph of a Collection of k-mers
BA3D assemble.py Construct the De Bruijn Graph of a String
BA3E assemble.py Construct the De Bruijn Graph of a Collection of k-mers
BA3F assemble.py Find an Eulerian Cycle in a Graph
BA3G assemble.py Find an Eulerian Path in a Graph
BA3H assemble.py Reconstruct a String from its k-mer Composition
BA3I assemble.py Find a k-Universal Circular String
BA3J assemble.py Reconstruct a String from its Paired Composition
asmq ASMQ.py Assessing Assembly Quality with N50 and N75
corr CORR.py Error Correction in Reads
gasm GASM.py Genome Assembly Using Reads
grep GREP.py Genome Assembly with Perfect Coverage and Repeats
KMER assemble.py Generalizing GC-Content
long LONG.py Genome Assembly as Shortest Superstring
pcov PCOV.py Genome Assembly with Perfect Coverage

4. How Do We Sequence Antibiotics? See also Computational Mass Spectrometry

# Location Description
BA4A spectrum.py Translate an RNA String into an Amino Acid String
BA4B spectrum.py Find Substrings of a Genome Encoding a Given Amino Acid String
BA4C spectrum.py Generate the Theoretical Spectrum of a Cyclic Peptide
BA4D spectrum.py Compute the Number of Peptides of Given Total Mass
BA4E spectrum.py Find a Cyclic Peptide with Theoretical Spectrum Matching an Ideal Spectrum
BA4F spectrum.py Compute the Score of a Cyclic Peptide Against a Spectrum
BA4G spectrum.py Implement LeaderboardCyclopeptideSequencing
BA4H spectrum.py Generate the Convolution of a Spectrum
BA4I spectrum.py Implement ConvolutionCyclopeptideSequencing
BA4J spectrum.py Generate the Theoretical Spectrum of a Linear Peptide
BA4K spectrum.py Compute the Score of a Linear Peptide
BA4L spectrum.py Trim a Peptide Leaderboard
BA4M ba4m.py spectrum.py Solve the Turnpike Problem
- tyrocidine.py Challenge Problems from Bioinformatics Algorithms: sequence Tyrocidine from supplied spectrum WIP
conv conv.py+spectrum.py Comparing Spectra with the Spectral Convolution
full full.py+spectrum.py Inferring Peptide from Full Spectrum
prtm spectrum.py Calculating Protein Mass - get_weight(...)
prsm prsm.py+spectrum.py Matching a Spectrum to a Protein
spec spec.py+spectrum.py Inferring Protein from Spectrum
sgra sgra.py+spectrum.py Using the Spectrum Graph to Infer Peptides

5. How do we compare Biological Sequences? See also Alignment

# Location Description
BA5A align.py Find the Minimum Number of Coins Needed to Make Change
BA5B align.py Find the Length of a Longest Path in a Manhattan-like Grid
BA5C align.py Find a Longest Common Subsequence of Two Strings
BA5D align.py Find the Longest Path in a DAG
BA5E align.py Find a Highest-Scoring Alignment of Two Strings
BA5F align.py Find a Highest-Scoring Local Alignment of Two Strings
BA5G align.py Compute the Edit Distance Between Two Strings
BA5H BA5H.py align.py Find a Highest-Scoring Fitting Alignment of Two Strings
BA5I BA5I.py align.py Find a Highest-Scoring Overlap Alignment of Two Strings
BA5J BA5J.py align.py Align Two Strings Using Affine Gap Penalties
BA5K BA5K.py align.py Find a Middle Edge in an Alignment Graph in Linear Space
BA5L BA5L.py align.py Align Two Strings Using Linear Space WIP
BA5M BA5M.py align.py Find a Highest-Scoring Multiple Sequence Alignment
BA5N align.py align.py Find a Topological Ordering of a DAG
ctea ctea.py align.py Counting Optimal Alignments
edit edit.py align.py Edit Distance
edta edta.py align.py Edit Distance Alignment
gaff gaff.py align.py Global Alignment with Scoring Matrix and Affine Gap Penalty
gcon gcon.py align.py Global Alignment with Constant Gap Penalty
GLOB glob.py align.py Global Alignment with Scoring Matrix
hamm rosalind.py Counting Point Mutations
laff laff.py Local Alignment with Affine Gap Penalty
loca loca.py align.py Local Alignment with Scoring Matrix
mgap mgap.py align.py Maximizing the Gap Symbols of an Optimal Alignment
mult mult.py align.py Multiple Alignment
oap oap.py align.pys Overlap Alignment
osym osym.py align.py Isolating Symbols in Alignments WIP
pdst align.py Creating a Distance Matrix
sims sims.py align.py Finding Mutated Motifs
smgb smgb.py align.py Semiglobal Alignment
tran rosalind.py Transitions and Transversions

6. Are there fragile regions in the human genome?

# Location Description
BA6A BA6A.py fragile.py Implement GreedySorting to Sort a Permutation by Reversals
BA6B BA6B.py fragile.py Compute the Number of Breakpoints in a Permutation
BA6C BA6C.py fragile.py Compute the 2-Break Distance Between a Pair of Genomes
BA6D BA6D.py Find a Shortest Transformation of One Genome into Another by 2-Breaks WIP
BA6E fragile.py Find All Shared k-mers of a Pair of Strings
BA6F BA6F.py fragile.py Implement Chromosome to Cycle
BA6G BA6G.py fragile.py Implement Cycle to Chromosome
BA6H BA6H.py fragile.py Implement Coloured Edges
BA6I BA6I.py fragile.py Implement GraphToGenome
BA6J BA6J.py fragile.py Implement 2-BreakOnGenomeGraph
BA6K BA6K.py fragile.py Implement 2-BreakOnGenome

7. Which animal gave us SARS? See also Phylogeny

# Location Description
BA7A BA7A.py phylogeny.py Compute Distances Between Leaves
BA7B BA7B.py phylogeny.py Limb Length Problem
BA7C BA7C.py phylogeny.py Implement Additive Phylogeny See also
BA7D BA7D.py phylogeny.py Implement UPGMA
BA7E BA7E.py phylogeny.py Implement the Neighbor Joining Algorithm
BA7F BA7F.py phylogeny.py Implement SmallParsimony
BA7G BA7G.py phylogeny.py Adapt SmallParsimony to Unrooted Trees
alph alph.py phylogeny.py Alignment-Based Phylogeny
chbp chbp.py phylogeny.py Character-Based Phylogeny
cntq cntq.py phylogeny.py Counting Quartets
cset cset.py phylogeny.py Fixing an Inconsistent Character Set
cstr cstr.py phylogeny.py Creating a Character Table from Genetic Strings
ctbl ctbl.py phylogeny.py Creating a Character Table
cunr phylogeny.py Counting Unrooted Binary Trees
eubt eubt.py phylogeny.py Enumerating Unrooted Binary Trees
inod - Counting Phylogenetic Ancestors No code needed
mend mend.py phylogeny.py Inferring a Genotype from a Pedigree
nkew nwck.py Weighting the Tree
nwck nwck.py Distances in Trees
pdst rosalind.py Creating a Distance Matrix
qrt qrt.py phylogeny.py Quartets
qrtd qrtd.py phylogeny.py Quartet Distance WIP
root phylogeny.py Counting Rooted Binary Trees
rsub rsub.py phylogeny.py Identifying Reversing Substitutions
sptd sptb.py phylogeny.py Phylogeny Comparison with Split Distance
tree tree.py phylogeny.py Completing a Tree

8. How did Yeast become a Winemaker?

# Location Description
BA8A BA8A.py Implement FarthestFirstTraversal
BA8B BA8B.py Compute the Squared Error Distortion
BA8C BA8C.py Implement the Lloyd Algorithm for k-Means Clustering
BA8D BA8D.py Implement the Soft k-Means Clustering Algorithm
BA8E BA8E.py Implement Hierarchical Clustering

9. How do we locate disease causing mutations?

# Location Description
snp.py Code for: How do we locate disease causing mutations?
BA9A BA9A.py snp.py Construct a Trie from a Collection of Patterns
BA9B BA9B.py snp.py Implement TrieMatching
BA9C BA9C.py snp.py Construct the Suffix Tree of a String
BA9D BA9D.py Find the Longest Repeat in a String
BA9E BA9E.py snp.py Find the Longest Substring Shared by Two Strings
BA9F BA9F.py Find the Shortest Non-Shared Substring of Two Strings WIP
BA9G BA9G.py snp.py Construct the Suffix Array of a String
BA9H BA9H.py Pattern Matching with the Suffix Array
BA9I BA9I.py snp.py Construct the Burrows-Wheeler Transform of a String
BA9J BA9J.py snp.py Reconstruct a String from its Burrows-Wheeler Transform
BA9K BA9K.py snp.py Generate the Last-to-First Mapping of a String
BA9L BA9L.py snp.py Implement BWMatching
BA9M BA9M.py snp.py Implement BetterBWMatching
BA9N BA9N.py snp.py Find All Occurrences of a Collection of Patterns in a String
BA9O BA9O.py Find All Approximate Occurrences of a Collection of Patterns in a String WIP
BA9P BA9P.py Implement Tree Colouring
BA9Q BA9Q.py snp.py Construct the Partial Suffix Array of a String
BA9R BA9R.py Construct a Suffix Tree from a Suffix Array WIP

10. Why have biologists still not developed an HIV vaccine?

# Location Description
BA10A BA10A.py hmm.py Compute the Probability of a Hidden Path
BA10B BA10B.py hmm.py Compute the Probability of an Outcome Given a Hidden Path
BA10C BA10C.py hmm.py Implement the Viterbi Algorithm
BA10D BA10D.py hmm.py Compute the Probability of a String Emitted by an HMM
BA10E BA10E.py hmm.py Construct a Profile HMM
BA10F BA10F.py hmm.py Construct a Profile HMM with Pseudocounts
BA10G BA10G.py hmm.py Perform a Multiple Sequence Alignment with a Profile HMM WIP
BA10H BA10H.py hmm.py Estimate the Parameters of an HMM
BA10I BA10I.py hmm.py Implement Viterbi Learning
BA10J BA10J.py hmm.py Solve the Soft Decoding Problem
BA10K BA10K.py hmm.py Implement Baum-Welch Learning

11. Was T.rex just a big chicken?

# Location Description
BA11A BA11A.py spectrum.py Spectrum Graph Construction
BA11B BA11B.py spectrum.py Implement DecodingIdealSpectrum
BA11C BA11C.py spectrum.py Convert a Peptide into a Peptide Vector
BA11D BA11D.py spectrum.py Convert a Peptide Vector into a Peptide
BA11E BA11E.py spectrum.py Sequence a Peptide
BA11F BA11F.py spectrum.py Find a Highest-Scoring Peptide in a Proteome against a Spectrum WIP
BA11G BA11G.py spectrum.py Implement PSMSearch WIP
BA11H BA11H.py spectrum.py Compute the Size of a Spectral Dictionary
BA11I BA11I.py spectrum.py Compute the Probability of a Spectral Dictionary
BA11J BA11J.py spectrum.py Find a Highest-Scoring Modified Peptide against a Spectrum WIP

Combinatorics

# Location Description
combinatorics.py Combinatorics
aspc rosalind.py Introduction to Alternative Splicing
cat cat.py combinatorics.py Catalan Numbers and RNA Secondary Structures
cntq cntq.py Counting Quartets
cunr phylogeny.py Counting Unrooted Binary Trees
eubt eubt.py phylogeny.py Enumerating Unrooted Binary Trees
fibd rosalind.py Mortal Fibonacci Rabbits
fib rosalind.py Rabbits and Recurrence Relations
motz motz.py combinatorics.py Motzkin Numbers and RNA Secondary Structures
mrep mrep.py Identifying Maximal Repeats
mrna rosalind.py Inferring mRNA from Protein
orf orf.py Open Reading Frames
pmch pmch.py Perfect Matchings and RNA Secondary Structures
pper rosalind.py Partial Permutations
rnas rnas.py Wobble Bonding and RNA Secondary Structures WIP
root phylogeny.py Counting Rooted Binary Trees

Genome Rearrangements

# Location Description
lgis lgis.py Longest Increasing Subsequence
rear rear.py fragile.py Reversal distance
sort sort.py fragile.py Sorting by Reversals

Graphs & Graph Algorithms

# Location Description
2sat sat.py graphs.py 2-Satisfiability
bf bf.py graphs.py Bellman-Ford Algorithm
bfs bfs.py graphs.py Breadth First search
bins BINS.py Binary Search
bip bip.py graphs.py Testing Bipartiteness
cc cc.py graphs.py Connected Components
cte cte.py graphs.py Shortest Cycle Through a Given Edge
dag dag.py graphs.py Testing Acyclicity
ddeg DDEG.py graphs.py Double-Degree Array
deg DEG.py graphs.py Degree Array
dfs graphs.py Depth First Search ???
dij dij.py graphs.py Dijkstra's Algorithm
grph graphs.py Overlap Graphs
gs gs.py graphs.py General Sink
hdag hdag.py graphs.py Hamiltonian Path in DAG
lrep lrep.py Finding the Longest Multiple Repeat WIP
nwc nwc.py graphs.py Negative Weight Cycle
scc scc.py graphs.py Strongly Connected Components
sc sc.py graphs.py Semi-Connected Graph
sdag sdag.py graphs.py Shortest Paths in DAG
sq sq.py graphs.py Square in a Graph
suff suff.py graphs.py Creating a Suffix Tree
tree tree.py phylogeny.py Completing a Tree
trie trie.py snp.py Introduction to Pattern Matching
ts ts.py align.py Topological sort

Heredity

# Location Description
iev rosalind.py Calculating Expected Offspring
indq rosalind.py Independent Segregation of Chromosomes
iprb rosalind.py Mendel's First Law
lia rosalind.py Independent Alleles
mend mend.py phylogeny.py Inferring a Genotype from a Pedigree
sexl rosalind.py Sex-Linked Inheritance

Population Dynamics

# Location Description
afrq rosalind.py Counting Disease Carriers
foun rosalind.py The Founder Effect and Genetic Drift
wfmd rosalind.py The Wright-Fisher Model of Genetic Drift

Probability

# Location Description
ebin rosalind.py Wright-Fisher's Expected Behavior
eval rosalind.py Expected Number of Restriction Sites
mend mend.py phylogeny.py Inferring a Genotype from a Pedigree
rstr rosalind.py Matching Random Motifs

Sequence Analysis

# Location Description
frmt FRMT.py Data Formats
gbk GBK.py GenBank Introduction
orfr orfr.py Finding Genes with Open Reading Frames
ptra PTRA.py Protein Translation

Sorting

# Location Description
2sum 2sum.py 2SUM
3sum 3sum.py 3SUM
ins INS.py Insertion Sort
inv INV.py Counting Inversions
med MED.py Median
mer MER.py Merge Two Sorted Arrays
mprt MPRT.py Finding a Protein Motif
ms MS.py Mergesort
par PAR.py 2-Way Partition
par3 PAR3.py 3-Way Partition
ps ps.py Partial sort
qs QS.py Quicksort

Strings & String Algorithms

# Location Description
cons rosalind.py Consensus and Profile
dna dna.py rosalind.py Counting DNA Nucleotides
gc rosalind.py Computing GC Content
itwv itwv.py align.py Finding Disjoint Motifs in a Gene
kmer rosalind.py Generalizing GC-Content
kmp kmp.py Speeding Up Motif Finding
ksim ksim.py Finding All Similar Motifs WIP
lcsm rosalind.py Finding a Shared Motif
lcsq lcsq.py Finding a Shared Spliced Motif
ling ling.py Linguistic Complexity of a Genome
ukkonen.py µTuX's implementation of Ukkonen's algorithm
mmch mmch.py Maximum Matchings and RNA Secondary Structures
prot spectrum.py Translating RNA into Protein
revc rosalind.py Complementing a Strand of DNA
revp revp.py rosalind.py Locating Restriction Sites
rna rna.py rosalind.py Transcribing DNA into RNA
scsp scsp.py Interleaving Two Motifs
splc spectrum.py RNA Splicing
sseq rosalind.py Finding a Spliced Motif
subs rosalind.py Finding a Motif in DNA

Set Theory

# Location Description
pdpl pdpl Creating a Restriction Map
seto - Set operations Solved using Python shell
sset - Counting Subsets Solved with a single line in Python shell: 2**n%1000000

Other Problems

# Location Description
fibo fibo.py Fibonacci numbers

C++ code

Location Header Description
Makefile for C++ code
newick.cpp newick.h Parser for files in Newick format
qrtd.cpp qrtd.h Quartet Distance WIP
test-newick.cpp Test Newick Parser
test-qrtd.cpp Test Quartet Distance WIP
test-tree.cpp Tests for tree.cpp
tests.cpp catch.hpp C++ test framework
tree.cpp tree.h Code representing a phylogeny

Support

Location Description
bioinformatics.bib Bibliography
findtests.py Generates shell script to execute all tests
helpers.py Utilities for formatting output, parsing input, etc
LICENSE License Agreement
newick.py Parser for files in Newick format
README.md This file
rosalind.py Shared code
rosalind.wpr WingWare Project File
template.py Template for getting started on a problem