zavolanlab / scRNAsim-toolz

A repository for the tools used by scRNAsim.
MIT License
1 stars 0 forks source link

feat: Specify distributions that will be used for modelling transcription, fragmentation and sampling of reads. #29

Open magmir71 opened 8 months ago

magmir71 commented 8 months ago

1. It is not clear now, how exactly the transcription process is modeled. It seems that transcription is modeled using just one parameter (average expression level) with Poisson distribution. However, at least two parameters are needed to specify smth like negative binomial (NB) distribution. For zero-inflation, you need yet another parameter. Most popular scRNA-seq expression quantification tools assume NB of zero-inflated NB distribution.

2. There is also a parameter "total number of reads", which is used in the last step of the pipeline. I assume it uses a multinomial distribution where the vector of probabilities correspond to simulated transcript counts from transcript generation step. However, other multivariate distributions could reflect the actual data much better. E.g., Dirichlet-multinomial distribution is often used to model overdispersed multinomial distribution.

Moreover, total number of reads is quite misleading because it seems that the number should represent the number of reads after deduplication. I think, if one could specify the number of PCR cycles for two amplification steps - before fragmentation and after fragmentation, it would be much more informative and useful.

3. It seems that fragmentation step produces just one fragmented cDNA from the original full-length cDNA. However, in real 10x Genomics data, you have multiple fragmented cDNAs from the same transcript, because one does fragmentation after 1st PCR amplification.