wilzbach / msa

Modular BioJS compoment for a multiple sequence alignment
http://msa.biojs.net
Boost Software License 1.0
168 stars 79 forks source link

Implement sophisticated seq logo algorithms #94

Closed wilzbach closed 2 years ago

wilzbach commented 9 years ago

This issue is a follow-up to #16.

Unfortunately the seqlogo doesn't handle this natively. (just as a reminder for myself)

original paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC332411/pdf/nar00204-0153.pdf

follow-ups:

http://bioinformatics.oxfordjournals.org/content/28/14/1935.long http://schneider.ncifcrf.gov/paper/hawaii/latex/paper.pdf

goldbergtatyana commented 9 years ago

The most popular and highest cited seq logo implementation is WebLogo of Steven Brenner: http://weblogo.berkeley.edu/logo.cgi

Link to the paper: http://weblogo.berkeley.edu/Crooks-2004-GR-WebLogo.pdf

From: Seb Sent: Wednesday, November 19, 2014 3:20 AM To: greenify/biojs-vis-msa Subject: [biojs-vis-msa] Implement sophisticated seq logo algorithms (#94)

This issue is a follow-up to #16.

Unfortunately the seqlogo doesn't handle this natively. (just as a reminder for myself)

original paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC332411/pdf/nar00204-0153.pdf

follow-ups:

http://bioinformatics.oxfordjournals.org/content/28/14/1935.long http://schneider.ncifcrf.gov/paper/hawaii/latex/paper.pdf

— Reply to this email directly or view it on GitHub.

wilzbach commented 9 years ago

This is now done with the new biojs-stat-seq, globally available under <msaInstance>.g.stats

It would also support to use a custom background or calculate the background from the sequences.

wilzbach commented 9 years ago

BTW how do we handle gaps in the alignment?

1) use the gaps in the calculation & filter them out afterwards (current solution) 2) filter the gaps out before the calculation

TatyanaGoldberg commented 9 years ago

Hi Seb,

I don't see in https://github.com/greenify/biojs-stat-seqs the information on how you calculate the sequence conservation at each position in the alignment. If you havent done so, please use the formula from the WebLogo paper [1]. The height of each base (or amino acid) per side is then its frequency at a specific position times the sequence conservation at this position.

BTW how do we handle gaps in the alignment?

If we have an alignment: AWKDAA -WKDAA AKKWDD AKAWDA

Then the frequency of A at position 1 would be 0.75. The information content (i.e. the sequence conservation) would be 4 and the height of stack A then 3.

[1] http://genome.cshlp.org/content/14/6/1188.long

wilzbach commented 9 years ago

how you calculate the sequence conservation at each position in the alignment.

https://github.com/greenify/biojs-stat-seqs/blob/master/lib/index.js#L216

If we have an alignment:

I also have written some tests.

trivia

maxLength 6
gaps [ 0.25, 0, 0, 0, 0, 0 ]
frequency [ { A: 0.75, '-': 0.25 },
  { W: 0.5, K: 0.5 },
  { K: 0.75, A: 0.25 },
  { D: 0.5, W: 0.5 },
  { A: 0.5, D: 0.5 },
  { A: 0.75, D: 0.25 } ]
consensus AWKDAA
identity to consensus [ 1, 1, 0.33, 0.33 ]
identity to AAAA [ 0.17, 0, 0.17, 0.33 ]

For the identity I ignore gaps.

ic

ic [ 0.81, 1, 0.81, 1, 1, 0.81 ]
scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ]

conservation [ 1.19, 1, 1.19, 1, 1, 1.19 ]
conservation scaled [ 0.59, 0.5, 0.59, 0.5, 0.5, 0.59 ]

conservation per residue [ { A: 0.89, '-': 0.3 },
  { W: 0.5, K: 0.5 },
  { K: 0.89, A: 0.3 },
  { D: 0.5, W: 0.5 },
  { A: 0.5, D: 0.5 },
  { A: 0.89, D: 0.3 } ]
conservation per residue scaled [ { A: 0.45, '-': 0.15 },
  { W: 0.25, K: 0.25 },
  { K: 0.45, A: 0.15 },
  { D: 0.25, W: 0.25 },
  { A: 0.25, D: 0.25 },
  { A: 0.45, D: 0.15 } ]

For the sequence logo I draw "conservation per residue scaled" with just omitting the gaps.

Background

Frequency of the letters

background { A: 0.38, W: 0.17, K: 0.21, D: 0.21, '-': 0.04 }

We could offer to calculate the information content against a given background e.g. Uniprot or the alignment itself.

TatyanaGoldberg commented 9 years ago

Comments:

We have following sample sequence alignment: AWKDAA -WKDAA AKKWDD AKAWDA

maxLength 6 gaps [ 0.25, 0, 0, 0, 0, 0 ] frequency [ { A: 0.75, '-': 0.25 }, { W: 0.5, K: 0.5 }, { K: 0.75, A: 0.25 }, { D: 0.5, W: 0.5 }, { A: 0.5, D: 0.5 }, { A: 0.75, D: 0.25 } ] consensus AWKDAA good identity to consensus [ 1, 1, 0.33, 0.33 ] The first sequence in the input file should be considered as consensus by default and the identity of all other sequences must be calculated according to the first sequence. If a user wants to calculate his own consensus sequence however then your calculation is right.

identity to AAAA [ 0.17, 0, 0.17, 0.33 ] Here, you cannot calculate identity without having aligned this sequence to the other four. The second sequence could align to AAAA like this: --AAAA WKDAA and so its identity to AAAA would be 0.40.

For the identity I ignore gaps. yes, the formula is simply dividing the number of identical residues by the sequence length.

ic [ 0.81, 1, 0.81, 1, 1, 0.81 ] how do you calculate the information content? Using the formula from the weblogo paper [1], ic at position 1 should be log2(20)-(-0.75log2(0.75))=4.32-(-0.75(-0.41))=4.02

scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ] how is this done and why?

conservation [ 1.19, 1, 1.19, 1, 1, 1.19 ] what is this and what is the difference between conservation and information content?

[1] http://genome.cshlp.org/content/14/6/1188.long

wilzbach commented 9 years ago

identity to consensus [ 1, 1, 0.33, 0.33 ] The first sequence in the input file should be considered as consensus by default and the identity of > all other sequences must be calculated according to the first sequence. If a user wants to calculate > his own consensus sequence however then your calculation is right.

Yup but this has nothing to do with the statistics package. The MSA viewer has the task of figuring out whether the user wants to calculate the consensus are already has a consensus sequence (hence two methods).

identity to AAAA [ 0.17, 0, 0.17, 0.33 ] Here, you cannot calculate identity without having aligned this sequence to the other four. The > second sequence could align to AAAA like this: --AAAA WKDAA and so its identity to AAAA would be 0.40.

I can't align sequences (too expensive) and therefore the identity calculation stupidly expects that your sequences are in an optimal alignment (which should be the case).

what is this and what is the difference between conservation and information content?

Information content (aka Entropy)

Information content

Conservation: Max. Information content - obs. information content

how do you calculate the information content? Using the formula from the weblogo paper [1], ic at position 1 should be log2(20)-(-0.75log2(0.75))=4.32-(-0.75(-0.41))=4.02

 { A: 0.75, '-': 0.25 }

- (0.75 * log2(0.75)) - (0.25 * log2(0.25))  = 0.81

(again I count the gaps as normal letters)

scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ] how is this done and why?

obs. ic / max. ic

It is a feature of the biojs-stat-seqs package and might be useful (e.g. barchart, ...) Currently we only display the conservation per residue.

TatyanaGoldberg commented 9 years ago

Hi Seb,

Two issues here:

{ A: 0.75, '-': 0.25 }

  • (0.75 * log2(0.75)) - (0.25 * log2(0.25)) = 0.81
    1. You should not consider gaps for calculating observed information content (i.e. entropy), as gap is not one of 4 for DNA/RNA or 20 for protein characters we are interested in the conservation of
    2. The max. information content is always log2(4) for DNA/RNA and log2(20) for proteins

Tatyana

wilzbach commented 9 years ago

You should not consider gaps for calculating observed information content (i.e. entropy), as gap is not one of 4 for DNA/RNA or 20 for protein characters we are interested in the conservation of

I changed the implementation to be consistent with the one from Biopython. Ignored chars like "-" are also ignored in the frequency calculation, so that

{ A: 0.75, '-': 0.25 } (1 * log2(1)) = 0

The max. information content is always log2(4) for DNA/RNA and log2(20) for proteins

Why is this an issue? The stat allows to enter any alphabet size.

TatyanaGoldberg commented 9 years ago
  1. The sequence logo is not shown for the full length of the alignment. Example file: https://rostlab.org/~goldberg/biojs/ssGP/ssgp/B2014122194A560KL7I.4.ids.clustal.
  2. Letter colors in seq logo do not always correspond to the colors in the MSA
wilzbach commented 9 years ago

Fixed 1 & 2 http://workmen.biojs.net/demo/biojs-vis-msa/ssgp

wilzbach commented 9 years ago

count gaps in the seq logo ...