Updating mm10 annotation files with PolyASite 2.0 release

SamBryce-Smith commented 4 years ago

Thanks for providing the mm10 annotation files for the poly(A) sites and for the clearly written documentation, I've been able to get PAQR running on our data pretty smoothly.

Judging from the commit dates I assume they were created with the initial PolyASite database release from 2016. An update was released in June 2019 and I think it would be useful to run PAQR with the latest reference sites to see if this has any effect on the reliability of the output. Would you be able to update the example mm10 annotation files to account for the new database release?

Alternatively could I also just get a little more information on how to set up the annotation files as I'd like to give this a try myself? The reply to the closed issue #5 is very useful, but can I just double-check you didn't do any filtering of the mouse_polyAclusters file containing poly(A) site annotations (i.e. the name of the file just reflects the name of the transcript annotation BED file (where same strand overlapping exons etc have been filtered out))? Apologies if that last question is a little unclear.

Thank you for your time, Sam

legbar commented 4 years ago

Would also be very interested to use the updated annotation!

koljaLanger commented 4 years ago

Hi Sam

thank you for your message. You are right, the mm10 annotation was established based on the original PolyASite database. We agree that any future analysis should be done based on the latest data release.

Here are some information what kind of filtering we included to obtain a final set of tandem poly(A) sites we used for our own analysis:

we restrict the scope of our analysis to canonical chromosomes (no contigs,...)
we only considered terminal exons from protein-coding and lincRNA genes
we only worked with exons that could be associated to a single gene unambiguously

Hopefully, this is the type of information you are looking for. Let us know if you have additional questions.

Ralf

SamBryce-Smith commented 4 years ago

Hi Ralf,

Thanks for the reply, apologies for the late response. I've had a go at generating some new annotations with the PolyASite 2.0 Atlas & the mouse Gencode vM25 release in line with your suggestions. I will make the repo public soon after doing some tidying up, but just have a few niggling questions as I seem to be getting fewer genes than in your annotation. I did apply some 'soft' filtering on transcript isoforms which could account for some differences (essentially only assessing the best supported group of transcripts for each gene (by transcript support level).

Regarding terminal exons with multiple poly(A) sites, did you only select exons that had at least two overlapping PolyASite annotations ? Or did you also consider exons that only had one overlapping PolyASite annotation (but would have two poly(A) sites if you consider the 3' end of the exon to be a poly(A) site)? I'm leaning towards the former, but to be honest I don't know how precisely 3'ends of transcripts in GTFs are defined in general.

Thanks again for your time and for your help, Sam

koljaLanger commented 4 years ago

Hi Sam

many thanks for your effort! It's great to hear you are willing to share the new annotations.

Regarding your issue: I am not surprised to hear that you get less genes than we got with the old annotation - if it's not different orders of magnitude. Moreover, most of the times people are interested in the effects on well-described genes - which would be included in your case anyway. Also, we do not consider the annotated 3' end of terminal exons as valid poly(A) site as long as it is not supported by our atlas. Indeed, we found that annotated 3' ends of transcripts with high support level coincide with a poly(A) site from our atlas in most cases. In summary, I agree with you in selecting only exons with at least two overlapping poly(A) sites from the atlas.

Hopefully this is of help for you, cheers,

Ralf

SamBryce-Smith commented 4 years ago

Hi Ralf,

Thanks for your reply & apologies for the delay again, I got caught up with other projects and squashing some bugs. I think what you said makes sense and seems to agree with what I've found with my new annotations.

Repo is here - hopefully it is easy enough to follow. Any feedback would be much appreciated (it's still a little bit work in progress so apologies for the choppiness!). For what it's worth the files seem to run through PAQR smoothly (I haven't tried KAPAC yet...). https://github.com/frattalab/paqr_annotations/

As a brief summary, this is how I satisfy outlined criteria:

Filter for transcripts that have a gene_type tag of 'protein_coding' or 'lncRNA'
Extract terminal exons for all transcripts. Terminal exons that overlap with coordinates of a different gene (gene in 'features' column of GTF) on the same strand are excluded from further analysis.
Valid terminal exons overlap with coordinates of at least two poly(A) site clusters in PolyASite BED file

To validate my workflow, I tried to replicate your annotations by running it with vM14 & PolyASite v1.0 annotations (no TSL filtering). Reassuringly I had pretty good overlap if consider genes with at least one non-overlapping,multi-poly(A) site transcript: Venn diagram comparing overlap of genes with a least one multi-poly(A) transcript

Of the 288 genes in the provided annotations but not in annotations produced with my workflow, 267 would otherwise pass my workflow but do not have 'protein_coding' or 'lncRNA' gene_type tags. I am happy to include them if there is a different way of identifying these protein-coding or lncRNA genes (my naive assumption was that these are not 'confidently annotated' genes). As for the others I'm not too sure, if you have any suggestions please do let me know!

One thing I do find is that I have a lot more transcripts with my annotations compared to your provided full_transcripts file. It may be worth filtering the 'transcript types' (e.g. for 'protein_coding' transcripts). I suspect a lot are 'redundant' (i.e. have exactly the same terminal exon coordinates, I should actually confirm this though...) but I couldn't think of a non-arbitrary way to select one of the transcripts. I'd really appreciate any input you could give on this (ideally I'd like to get my workflow as close to your approach concerning reproducibility)!

Apologies for the long reply and once again thanks for all your help!

All the best, Sam

koljaLanger commented 4 years ago

Hi Sam

many thanks for your work and your detailed description. It's good to see that your results are pretty close to what we obtain for our analysis. We are currently in the process to streamline our pipelines - therefore, your work might come in handy! Many thanks indeed! Regarding the open issues, please allow us to double check from our side. We'll come back to you asap.

Best Ralf

zavolanlab / PAQR_KAPAC

Updating mm10 annotation files with PolyASite 2.0 release #15