Open victorlin opened 5 years ago
Victor, Thanks for starting this. Let me know if you have any question.
TAD data in folder - /ufrc/zhou/share/projects/bioinformatics/p53/data/TAD
/ufrc/zhou/share/projects/bioinformatics/p53/data/RefSeqAll.txt file contains the following information for all RefSeq genes obtained from UCSC Table Browser - | name | Name of gene (usually transcript_id from GTF) | chrom | Reference sequence chromosome or scaffold | strand | + or - for strand | txStart | Transcription start position (or end position for minus strand item) | txEnd | Transcription end position (or start position for minus strand item) | cdsStart | Coding region start (or end position for minus strand item) | cdsEnd | Coding region end (or start position for minus strand item) | name2 | Alternate name (e.g. gene_id from GTF)
@leizhou69 I am currently writing the script to calculate the location in TAD and distance to TSS. The data is in the form of large intervals which may contain multiple motifs. If the TAD and TSS distance is per motif, how should we handle the case of multiple values per sample? This is similar to the previous problem of multiple MAST scores per sample - we used count/sum/average for that. However, it may have a different meaning with the TAD location / TSS distance.
Let me know if this doesn't make sense - it's a bit difficult to put in words.
Create new dataset (tentative name:
motif_matches
). Specifications from Dr. Lei Zhou:Q - What differentiates the motifs that are bound by P53 from those that do not have binding?
Positive samples: binding sites with a good MAST match to one of the MEME matrix.
Negative samples: for each positive, find 4 sites with similar MAST score and identical length (centered on the motif location).
Features: