zavolanlab / scRNAsim-toolz

A repository for the tools used by scRNAsim.
MIT License
1 stars 0 forks source link

test: #4 priming site predictor #24

Open ninsch3000 opened 8 months ago

ninsch3000 commented 8 months ago

README description

Priming Site Predictor of Transcript Sequences

Overview

Priming Site Predictor which uses a seed-and-extension algorithm (RIblast: https://github.com/fukunagatsu/RIblast) to Predict Priming Sites of oligo dT primers in target sequences. Furthermore, Binding Energies are calculated and classified with a threshold value. Additionally, the binding sites are associated with Binding Probabilities and stored in a gtf file for further processes.

Version

Version 0.1.0 (2022/11/15)

Installation from GitLab

Priming Site Predictor requires Python 3.9 or later.

Install Priming Site Predictor from GitLab using:

git clone https://git.scicore.unibas.ch/zavolan_group/tools/priming-site-predictor.git
cd priming-site-predictor

Create scRNA-seq-simulation conda environment:

conda env create --file environment.yml
conda activate scrna-seq-sim

Usage

usage: priming-site-predictor [-h] [-f FASTA_FILE] [-p PRIMER_SEQUENCE] [-e ENERGY_CUTOFF] [-r RIBLAST_OUTPUT] [-o OUTPUT_FILENAME]

Compute potential priming sites using RIBlast.

options:
  -h, --help            show this help message and exit
  -f FASTA_FILE, --fasta-file FASTA_FILE
                        Fasta-formatted file of transcript sequences
  -p PRIMER_SEQUENCE, --primer-sequence PRIMER_SEQUENCE
                        Primer sequence
  -e ENERGY_CUTOFF, --energy-cutoff ENERGY_CUTOFF
                        Energy cutoff for interactions
  -r RIBLAST_OUTPUT, --riblast-output RIBLAST_OUTPUT
                        Path to RIBlast output file
  -o OUTPUT_FILENAME, --output-filename OUTPUT_FILENAME
                        Path where the output gtf should be written

Example

RIblast usage:

RIblast db -i tests/priming_site_predictor/files/riblast_test_files/dbRNA_test.fa -o tests/priming_site_predictor/files/riblast_test_files/test_db
RIblast ris -i tests/priming_site_predictor/files/riblast_test_files/queryRNA_test.fa -o tests/priming_site_predictor/files/RIBlast_output_example.txt -d tests/priming_site_predictor/files/riblast_test_files/test_db
priming-site-predictor --riblast-output tests/priming_site_predictor/files/RIBlast_output_example.txt --output-filename priming_sites.gtf

License

This software is released under the MIT License, see LICENSE.txt.

Changelogs

2022/11/15 Version 0.1.0 was released.

Contributors

Max Bär, Sophie Schnider, Robin Christen (University of Basel)

Acknowledgements

We used the RIblast algorithm created by Tsukasa Fukunaga (https://github.com/fukunagatsu).

Reference

Tsukasa Fukunaga and Michiaki Hamada. "RIblast: An ultrafast RNA-RNA interaction prediction system based on a seed-and-extension approach." btx287, Bioinformatics (2017)

Original issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/4

Predicting priming sites within transcripts

Compute the probability of internal priming along individual transcripts, given the transcript sequences, the primer sequence and a cutoff in the energy of interaction beyond which interactions should be reported. The following software https://github.com/fukunagatsu/RIblast could be used to predict interactions. Given the energy of interaction, the probability of hybridization at a given position i is given by pi = exp(-E/kT), where k is the Bolzmann constant (1.380649×10−23 J⋅K−1) and T is the temperature in degrees Kelvin (we can take that to be 298 (or 25 degrees Celsius). The probability of a position to be chosen as priming site should be computed as pi_norm = pi/sum_i pi.

Input:

  1. fasta-formatted file of transcript sequences
  2. primer sequence
  3. cutoff for the energy of interaction (energy of interaction should be < cutoff)

Output: gff-formatted file potential priming sites within each transcript, for each priming site reporting their coordinate within the corresponding transcript and the associated probability.

Pipeline overview description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation Knowing that during the cDNA synthesis process poly(A) stretches act as priming sites, we predict how likely it is to initiate synthesis at every position on each transcript. We assume that this depends on the energy of binding between the poly(T) primer and the stretch of transcript starting at the specified position. We run an external program to predict this energy and we thus obtain, for each each transcript a list of positions where priming as non-0 probability (according to hybridization parameters specified by input #11).