wilsonlabgroup / pituitary_transcriptome_analyses

The repo contains essential scripts and datasets to reproduce bioinformatics analyses in Hou & Chan et al. 2022
0 stars 0 forks source link

reformat_anno function #2

Closed Dalhte closed 1 year ago

Dalhte commented 1 year ago

Hello there I'm trying to replicate what you did in your paper on the mouse pituitary: I would like to construct a custom annotation of the pituitary rat transcriptom as I notice that I have the same problems you got (truncated annotation for Prop1 and so on). I replicate your code, mostly but I am stuck as I do not know how to got the "reformat_anno" function. Can you help me ? Best David

hy09 commented 1 year ago

Hi David, this function is defined in the script UTRseq_analysis/PA_site_identification/puberty_utr_PAidentification_functions.r. If you source this script you should have it. Sorry for the confusion!

Best, Huayun

Dalhte commented 1 year ago

Dear Huayun

Thanks a lot ! I found it :)

I may have an other question... I'm working on the rat genome, so there isn't always clear correspondence between the "mouse" files you used and the ones for rats and I have difficulties to find some of the databases : -Concerning the "mm10_UCSC_refSeq_3utr_1906.gtf" for example, did you construct it from the gencode.vM21.annotation.gtf by keeping only info related to 3prime UTR ? Same for the 'mm10_gencode_vM21_geneName.rds' and 'mm10_gencode_vM21_geneType.rds", are those extracted from the original .gtf ? If it is the case, which package do you use to manipulate .gtf ? "refGenome" ?

I hope I'm not asking to much !

Thanks for the great work by the way.

Best

David

hy09 commented 1 year ago

Hi David,

Glad to hear the work is helpful and sorry about the lack of detailed information there. I'll reply in line:

_-Concerning the "mm10_UCSC_refSeq_3utr1906.gtf" for example, did you construct it from the gencode.vM21.annotation.gtf by keeping only info related to 3prime UTR ?

Yes, for this file I essentially extracted only the 3'UTR information from the refseq gene annotation I obtained from UCSC (not gencode). I primarily used the gencode annotation but during my analysis noticed that for many genes, refseq annotation could be complementary. So I considered the refseq 3'UTRs as well. This step might not be necessary if you just want to use one primary source of gene annotation.

_Same for the 'mm10_gencode_vM21_geneName.rds' and 'mm10_gencode_vM21geneType.rds", are those extracted from the original .gtf ? If it is the case, which package do you use to manipulate .gtf ? "refGenome" ?

Yes, I extracted these from the original gtf file. Just saved them as individual objects for convenience. I parsed the gtf filel directly but you can use the R package "rtracylayer" to easily import gtf files and extract the information.

I apologize that my scripts were highly customized for my analysis and might not be easily adapted for another genome. For a more streamlined version of the workflow, you could 1) identify read clusters from your data using derfinder (this would work best if you have 3'UTRseq data) and 2) annotate these read clusters to genes based on customized criteria, such as distance to gene ends etc.

Hope it helps! Huayun

Dalhte commented 1 year ago

Dear HyaYun Thanks a lot, it's really helping as I'm new to these kind of analysis. I will try what you are advising :) Thanks again Best David

hy09 commented 1 year ago

You are welcome. Good luck with your analysis!