Open khandarius opened 5 years ago
Hi Darius, the site count matrix is very specific to the set of poly(A) sites that you use. That is why it is very hard for us to provide a universal input file in this case. If you use the poly(A) sites that we are using, of course you can also use the site count file that we provide. However, it is not too difficult to create the file yourself: you need the genomic coordinates of your poly(A) sites and the corresponding fasta file of the genome of interest. Then, you scan over the region of each poly(A) site (with an extension up- and downstream of the site) and simply count every kmer that you encounter.
Hopefully, this is helping you. Let us know if you have further questions.
Best regards, Ralf
Hi Darius, maybe you also want to have a look in the model pipeline we uploaded to zenodo: https://doi.org/10.5281/zenodo.1147433 If you are familiar with snakemke, this is the best way to check our approach to create the site count matrices. It includes also the script we use.
Best regards, Ralf
Hi Darius, maybe you also want to have a look in the model pipeline we uploaded to zenodo: https://doi.org/10.5281/zenodo.1147433 If you are familiar with snakemke, this is the best way to check our approach to create the site count matrices. It includes also the script we use.
Best regards, Ralf
Hello @koljaLanger , I also try to use KAPAC in my project. I find the link you provided. But the file is huge. Could you provide a test data and script less than 100MB?
Hi I am really sorry that the model pipeline archive became that big. This is because it allows to recapitulate the results from our paper which was only possible when the used bam files were included in the archive.
Of course I don't know the reason that prevents you from downloading the archive. But just in case it is the disk space my suggestions would be: download and unpack the archive and then use samtools to create small random samples from the bam files. This would massively reduce the size needed.
Please let us know if this would be of any value for you. Otherwise, we might find another solution.
Best, Ralf
Hi I am really sorry that the model pipeline archive became that big. This is because it allows to recapitulate the results from our paper which was only possible when the used bam files were included in the archive.
Of course I don't know the reason that prevents you from downloading the archive. But just in case it is the disk space my suggestions would be: download and unpack the archive and then use samtools to create small random samples from the bam files. This would massively reduce the size needed.
Please let us know if this would be of any value for you. Otherwise, we might find another solution.
Best, Ralf
Thanks for your response! I want to prepare KAPAC input files from PAQR output result. As you suggestions, small bam file can be created by samtools. So, could you only provide an example with small size? This small size example maybe more useful for a new user of KAPAC.
Hi, I have two options in mind how we best solve this problem: 1. I down-sampled the BAM input files (each one is now 12 MB in size) and created a new snakemake archive. Still, it is 1.3 GB big. However, it has the big advantage that it is self-contained and can run on linux as independent snakemake pipeline. Please let me know if this would help you. In this case I would consider uploading this pipeline to zenodo, too. 2. I created another git repo that only contains the scripts and auxiliary files of the model pipeline. If you're simply interested in using our scripts, clone the following repo: https://git.scicore.unibas.ch/zavolan_public/paqr_kapac_modelpipeline_only_scripts
I hope, this allows you to use KAPAC and PAQR.
Kind regards,
Ralf
Hello,
I'm interested in using PAQR and KAPAC on my own samples, but I'm unsure about the KAPAC input files. Specifically, I would like to know how to obtain the site count file for KAPAC (corresponding to kmer_counts.tsv of the test data). I don't see a suitable file among the output of PAQR.
Best regards, Darius