zavolanlab / scRNAsim-toolz

A repository for the tools used by scRNAsim.
MIT License
1 stars 0 forks source link

test: #3 sequence extractor #23

Open ninsch3000 opened 8 months ago

ninsch3000 commented 8 months ago

README description

Extract transcript sequences

Project aim:

Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails.

Input:

Output:

For each transcript, the list of exons should be traversed from 5' to 3', the sequences of the exons need to be extracted from the genome given the coordinates and then pasted together. At the end, a tail of the specified length should be added at the 3' end of the transcript, given a vector of mono-nucleotide frequencies (of course, the frequency of A's will be much higher than of any other nucleotide).

Design plan

1- Obtain gtf file and also generate a test file for code validation (sampled transcript gtf, from Group 2) :

a. To use as main test file: Reference gtf file
b. Subset the gtf file so that only the rows labelled exon remain
c. Extract gtf file data to make a BED file (bedtools getfasta -split only works on BED12 files)
d. For the sample test file, use just one gene based on gene id. Later on, the sample test file can be expanded as required

2- To run bedtools on the genome with data from sample/final bed file and extract exon sequences for all transcripts (bedtools getfasta) :

a. To use as genome sequence: https://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz (Human Genome release 107 from Ensembl)
b. Using bedtools getfasta and its options, get the output c. Output will be: transcript sequences containing all exon data in 5’ to 3’ direction without poly A tail

3-Function to add poly A tail at the 3’ end (right side) of RNA sequence:

a. Input: The output transcripts from the previous step b. This is based on length of tail (sample a length of 250 nucleotides)
c. It needs to take into account the probability (weight/relative frequency) of each nucleotide (with A having the largest frequency defined)- this will be saved as a dictionary. {‘A’:freqA,‘U’:freqU,‘G’:freqG,‘C’:freqC}
d. Append via e.g. str.join(), str.ljust()
e. Output the final transcript sequences as a .fasta file. (Final Output)

Installation

TBA

Usage/Examples

sequence-extractor --mode pre_bedtools --input-gtf-file tests/sequence_extractor/files/test.gtf --output-bed-file tests/sequence_extractor/files/output.bed
gunzip -k tests/sequence_extractor/files/human.chr1.fa.gz 
bedtools getfasta -fi tests/sequence_extractor/files/human.chr1.fa -bed tests/sequence_extractor/files/output.bed -name -s -fo tests/sequence_extractor/files/output.fasta
sequence-extractor --mode post_bedtools --input-fasta-file tests/sequence_extractor/files/output.fasta --polyA-length 250 --output-file-name polyA_output.fa

sequence-extractor --mode post_bedtools --input-fasta-file tests/sequence_extractor/files/post_bedtools_test.fa --polyA-length 250 --output-file-name polyA_output.fa

Original issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/3

Extract transcript sequences

Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails.

Input:

  1. Gtf file with exon/intron structures of transcripts
  2. File with genome sequence
  3. Length of the poly(A) tail
  4. Dictionary of expected nucleotide frequencies in poly(A) tail

Output: fasta-formatted file of transcript sequences

For each transcript, the list of exons should be traversed from 5' to 3', the sequences of the exons need to be extracted from the genome given the coordinates and then pasted together. At the end, a tail of the specified length should be added at the 3' end of the transcript, given a vector of mono-nucleotide frequencies (of course, the frequency of A's will be much higher than of any other nucleotide).

Note, the poly(A) tail can be added by a padding function, which could be used on other contexts as well.

One could use bedtools getfasta for extracting the transcript sequences. The gene annotations would need to be filtered first to only include the features we want to extract. This could easily done with awk, sed or even grep. See this blog post on how to filter GTF files with awk.

This issue then does not require any actual coding, one would just need to add 3 steps to the Nextflow workflow (or, perhaps better, create a Nextflow subworkflow with these 3 steps, then import it in the main one):

  1. Filter gene annotations, keep only features of type transcript
  2. Extract transcript sequences with bedtools getfasta
  3. Run the script from issue #25 / merge request !8

Pipeline overview description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation To start the "sample preparation" process, the transcript sequences need to be constructe based on the structure specified in the gff/gtf file and the genome sequence. Poly(A) tails are added to transcripts accoroding to input #10.