zavolanlab / scRNAsim-toolz

A repository for the tools used by scRNAsim.
MIT License
1 stars 0 forks source link

test: #1 transcript sampler #21

Open ninsch3000 opened 8 months ago

ninsch3000 commented 8 months ago

README description

Overview

This workflow samples representative transcripts per gene, in proportion to their relative abundance levels. Sampling is done by Poisson sampling.

This workflow takes as input:

The outputs are :

Installation from github

Transcript sampler requires Python 3.9 or later.

Install Transcript sampler from Github using:

git clone https://git.scicore.unibas.ch/zavolan_group/tools/transcript-sampler.git
cd transcript-sampler
pip install . 

Usage

usage: transcript-sampler [-h] --input_gtf INPUT_GTF --input_csv INPUT_CSV --output_gtf OUTPUT_GTF --output_csv OUTPUT_CSV --n_to_sample N_TO_SAMPLE

Transcript sampler

options:
  -h, --help            show this help message and exit
  --input_gtf INPUT_GTF
                        GTF file with genome annotation (default: None)
  --input_csv INPUT_CSV
                        CSV or TSV file with transcripts and their expression level (default: None)
  --output_gtf OUTPUT_GTF
                        Output path for the new GTF file of representative transcripts (default: None)
  --output_csv OUTPUT_CSV
                        Output path for the new CSV file of representative transcripts and their sampled number (default: None)
  --n_to_sample N_TO_SAMPLE
                        Total number of transcripts to sample (default: None)

Example :

transcript-sampler --input_gtf tests/transcript_sampler/files/test.gtf --input_csv tests/transcript_sampler/files/expression.csv --output_gtf sampled.gtf --output_csv sampled.csv --n_to_sample 100

Original issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/1 Generate the "RNA content" of a single cell by sampling transcripts in proportion to the relative expression of levels of their corresponding genes (provided as input), up to a given total transcript count.

Inputs:

  1. Csv-formatted file with gene expression levels "GeneID,Count"
  2. Total number of transcripts to sample for a single cell

Output: Csv-formatted file ("GeneID,Count") with gene expression levels in a "cell"

Other issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/28

Sample transcript counts given average expression levels

Given a total number transcripts, their relative abundance in a sample and the genome annotation, sample representative transcripts per gene, in proportion to their relative abundance levels.

Input:

  1. Csv-formatted file ("ID,Level") with expression levels per gene (or per transcript).
  2. Total number of transcripts to sample.
  3. gtf-formatted file with the intron/exon coordinates of the transcripts represented in the expression file.

Output:

  1. Gtf-formatted file of the sampled transcripts.
  2. Csv-formatted file ("ID,Count") with the transcript copies for each representative transcript.

First, we pick a representative transcript for each gene in the annotation file. This transcript has the highest level of experimental support (lowest transcript support level value). If there are multiple such transcripts for a gene, the one that covers the largest genomic region is chosen (based on the coordinates of the exons).

Then, we sample transcript counts up to a specified total, in proportion to the gene expression levels given in the input 1. The expression levels can be provided either per transcript ID or per gene ID. If transcript expression levels are given, these transcripts are not guaranteed to be the representative ones, but the expression should be extracted per representative transcript. If the expression level is provided per gene, it needs to be assigned to the representative transcript as well. So, a dictionary of representative transcript ID : gene ID has to be build first. Then the expression of all transcripts associated with the gene should be cumulated on a per gene basis (if the expression values are not already provided per gene) and then the gene expression level should be transferred to the representative transcript and written out.

Pipeline overview description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation Pick the number of transcripts coming from each gene. As #input 1 we get a file with the expression level of individual transcripts from some real sample. For simplicity, we first pick a representative transcript per gene, e.g. with most annotation support (support level 1 or TSL=1). Then, given a total number of transcripts per cell (input #1, we generate, for each representative transcript, a Poisson sample given the average count from input #1.