zavolanlab / PAQR_KAPAC

scripts, pipelines and documentation to run PAQR and KAPAC; KAPAC allows to infer regulatory sequence motifs implicated in 3’ end processing changes; PAQR enables the quantification of poly(A) site usage from standard RNA-seq data
GNU General Public License v2.0
8 stars 4 forks source link

use custome annotation #5

Closed yuxinghai closed 5 years ago

yuxinghai commented 5 years ago

hello, I wan't to use PAQR find APA in other species, and I have some polyA site information, how can i get annotation file like mm10,hg38

koljaLanger commented 5 years ago

Hi yuxinghai many thanks for your interest in PAQR.

You can have a look on this ensembl page which outlines all covered species. In the case your organism of interest is included there is a fair chance to also find an annotation file for it (a gtf file).

Hopefully, this helps, cheers,

Ralf

yuxinghai commented 5 years ago

Sorry, you may have misunderstood what I mean. I wan't to create a custome polyA annotation file like PAQR/data/annotation/clusters.hg38.canonical_chr.tandem.noOverlap_strand_specific.bed. Now I have polyA site file including chrom,start,end,tag_number,strand, but I don't know what it mean each columns in clusters.hg38.canonical_chr.tandem.noOverlap_strand_specific.bed. how can I build a file like this?

koljaLanger commented 5 years ago

Hi, now I understand better what you are looking for. Here is a description what the individual columns mean: 1.-6. columns: normal BED guide lines: 1. chromosome 2. start 3. end 4. ID 5. score (number of protocols that support a site, can be anything if you don't plan to filter your sites based on this score) 6. strand information

7.-10. columns: individual entries: 7. and 8. Since we want to look at alternative polyadenylation, we only consider "tandem poly(A) sites", which means the poly(A) sites are located on a single annotated exon. The 7th column contains consecutive numbering of a set of tandem poly(A) sites, the 8th column is the overall number of sites for this set. 9. An identifier for the exon, on which the tandem poly(A) sites are located. The id is composed of: :::: 10. Identifier of the gene, the exon/transcript belongs to

If you have your set of poly(A) sites, you have to intersect this set with an annotation to infer the above information.

Best Ralf

yuxinghai commented 5 years ago

Thank you, Very detailed explanation. Now I can create this file