zhoulab / p53-chip-seq-data

Basic machine learning on genomic data
0 stars 0 forks source link

New dataset based on MAST matches #29

Open victorlin opened 5 years ago

victorlin commented 5 years ago

Create new dataset (tentative name: motif_matches). Specifications from Dr. Lei Zhou:

Q - What differentiates the motifs that are bound by P53 from those that do not have binding?

Positive samples: binding sites with a good MAST match to one of the MEME matrix.

Negative samples: for each positive, find 4 sites with similar MAST score and identical length (centered on the motif location).

Features:

  1. K-mer (frequency, normalized to (0,1))
  2. average conservation score for the 20 bp motif match region (should be already in (0,1) range - you can get it from Varsha)
  3. average conservation score for the whole binding site.
  4. Location in TAD ( [location of the motif center] - [center of TAD]) / (Len(TAD)/2) --> normalized to (-1,1)
  5. distance to the closest TSS
leizhou69 commented 5 years ago

Victor, Thanks for starting this. Let me know if you have any question.

varsh1090 commented 5 years ago

TAD data in folder - /ufrc/zhou/share/projects/bioinformatics/p53/data/TAD

varsh1090 commented 5 years ago

/ufrc/zhou/share/projects/bioinformatics/p53/data/RefSeqAll.txt file contains the following information for all RefSeq genes obtained from UCSC Table Browser - | name | Name of gene (usually transcript_id from GTF)   | chrom | Reference sequence chromosome or scaffold   | strand | + or - for strand   | txStart | Transcription start position (or end position for minus strand item)   | txEnd | Transcription end position (or start position for minus strand item)   | cdsStart | Coding region start (or end position for minus strand item)   | cdsEnd | Coding region end (or start position for minus strand item)   | name2 | Alternate name (e.g. gene_id from GTF)

victorlin commented 5 years ago

@leizhou69 I am currently writing the script to calculate the location in TAD and distance to TSS. The data is in the form of large intervals which may contain multiple motifs. If the TAD and TSS distance is per motif, how should we handle the case of multiple values per sample? This is similar to the previous problem of multiple MAST scores per sample - we used count/sum/average for that. However, it may have a different meaning with the TAD location / TSS distance.

Let me know if this doesn't make sense - it's a bit difficult to put in words.