Input files for KAPAC - Githubissues

zavolanlab / PAQR_KAPAC

scripts, pipelines and documentation to run PAQR and KAPAC; KAPAC allows to infer regulatory sequence motifs implicated in 3’ end processing changes; PAQR enables the quantification of poly(A) site usage from standard RNA-seq data

GNU General Public License v2.0

8 stars 4 forks source link

Input files for KAPAC #4

Open khandarius opened 5 years ago

khandarius commented 5 years ago

Hello,

I'm interested in using PAQR and KAPAC on my own samples, but I'm unsure about the KAPAC input files. Specifically, I would like to know how to obtain the site count file for KAPAC (corresponding to kmer_counts.tsv of the test data). I don't see a suitable file among the output of PAQR.

Best regards, Darius

koljaLanger commented 5 years ago

Hi Darius, the site count matrix is very specific to the set of poly(A) sites that you use. That is why it is very hard for us to provide a universal input file in this case. If you use the poly(A) sites that we are using, of course you can also use the site count file that we provide. However, it is not too difficult to create the file yourself: you need the genomic coordinates of your poly(A) sites and the corresponding fasta file of the genome of interest. Then, you scan over the region of each poly(A) site (with an extension up- and downstream of the site) and simply count every kmer that you encounter.

Hopefully, this is helping you. Let us know if you have further questions.

Best regards, Ralf

koljaLanger commented 5 years ago

Hi Darius, maybe you also want to have a look in the model pipeline we uploaded to zenodo: https://doi.org/10.5281/zenodo.1147433 If you are familiar with snakemke, this is the best way to check our approach to create the site count matrices. It includes also the script we use.

Best regards, Ralf

xflicsu commented 5 years ago

Hi Darius, maybe you also want to have a look in the model pipeline we uploaded to zenodo: https://doi.org/10.5281/zenodo.1147433 If you are familiar with snakemke, this is the best way to check our approach to create the site count matrices. It includes also the script we use.

Best regards, Ralf

Hello @koljaLanger , I also try to use KAPAC in my project. I find the link you provided. But the file is huge. Could you provide a test data and script less than 100MB?

koljaLanger commented 5 years ago

Hi I am really sorry that the model pipeline archive became that big. This is because it allows to recapitulate the results from our paper which was only possible when the used bam files were included in the archive.

Of course I don't know the reason that prevents you from downloading the archive. But just in case it is the disk space my suggestions would be: download and unpack the archive and then use samtools to create small random samples from the bam files. This would massively reduce the size needed.

Please let us know if this would be of any value for you. Otherwise, we might find another solution.

Best, Ralf

xflicsu commented 5 years ago

Hi I am really sorry that the model pipeline archive became that big. This is because it allows to recapitulate the results from our paper which was only possible when the used bam files were included in the archive.

Of course I don't know the reason that prevents you from downloading the archive. But just in case it is the disk space my suggestions would be: download and unpack the archive and then use samtools to create small random samples from the bam files. This would massively reduce the size needed.

Please let us know if this would be of any value for you. Otherwise, we might find another solution.

Best, Ralf

Thanks for your response! I want to prepare KAPAC input files from PAQR output result. As you suggestions, small bam file can be created by samtools. So, could you only provide an example with small size? This small size example maybe more useful for a new user of KAPAC.

koljaLanger commented 5 years ago

Hi, I have two options in mind how we best solve this problem: 1. I down-sampled the BAM input files (each one is now 12 MB in size) and created a new snakemake archive. Still, it is 1.3 GB big. However, it has the big advantage that it is self-contained and can run on linux as independent snakemake pipeline. Please let me know if this would help you. In this case I would consider uploading this pipeline to zenodo, too. 2. I created another git repo that only contains the scripts and auxiliary files of the model pipeline. If you're simply interested in using our scripts, clone the following repo: https://git.scicore.unibas.ch/zavolan_public/paqr_kapac_modelpipeline_only_scripts

I hope, this allows you to use KAPAC and PAQR.

Kind regards,

Ralf