New dataset based on MAST matches

zhoulab / p53-chip-seq-data

Basic machine learning on genomic data

0 stars 0 forks source link

New dataset based on MAST matches #29

Open victorlin opened 5 years ago

victorlin commented 5 years ago

Create new dataset (tentative name: motif_matches). Specifications from Dr. Lei Zhou:

Q - What differentiates the motifs that are bound by P53 from those that do not have binding?

Positive samples: binding sites with a good MAST match to one of the MEME matrix.

Negative samples: for each positive, find 4 sites with similar MAST score and identical length (centered on the motif location).

Features:

K-mer (frequency, normalized to (0,1))
average conservation score for the 20 bp motif match region (should be already in (0,1) range - you can get it from Varsha)
average conservation score for the whole binding site.
Location in TAD ( [location of the motif center] - [center of TAD]) / (Len(TAD)/2) --> normalized to (-1,1)
distance to the closest TSS

leizhou69 commented 5 years ago

Victor, Thanks for starting this. Let me know if you have any question.

varsh1090 commented 5 years ago

TAD data in folder - /ufrc/zhou/share/projects/bioinformatics/p53/data/TAD

varsh1090 commented 5 years ago

victorlin commented 5 years ago

@leizhou69 I am currently writing the script to calculate the location in TAD and distance to TSS. The data is in the form of large intervals which may contain multiple motifs. If the TAD and TSS distance is per motif, how should we handle the case of multiple values per sample? This is similar to the previous problem of multiple MAST scores per sample - we used count/sum/average for that. However, it may have a different meaning with the TAD location / TSS distance.

Let me know if this doesn't make sense - it's a bit difficult to put in words.