wmacnair / psupertime

psupertime is pseudotime ordering for single cell RNA-seq data with sequential labels
GNU General Public License v3.0
38 stars 13 forks source link

Cell number bias #19

Closed dylanmr closed 4 years ago

dylanmr commented 4 years ago

Hi Will!

Thanks for this amazing package! I have a dataset in which I have collected brain tissue across multiple time points and I am interested in understanding how similar/different the maturation of different neuronal populations are. My idea was to run psupertime on all of the individual populations and then test how well the models learned for each population work on all others. However, I definitely have certain populations from which i have sampled significantly more cells than others.

My question is: do you have any idea as to how the differences in cell number will influence the model? Would you suggest downsampling cells to the smallest population?

Thanks! Dylan

wmacnair commented 4 years ago

Hi Dylan

Thanks for this, it's an interesting question. I've been thinking about it a bit, and I think the answer depends a bit on exactly how you're going to measure performance. Do you mind giving a bit more detail of what you're thinking of doing? Are you thinking of measuring e.g. accuracy or cross-entropy of timepoint prediction?

Cheers Will

wmacnair commented 4 years ago

A couple of further thoughts:

Suppose you have two populations, one with many cells, one with few. If the differences between early and late timepoints are very clear, then I think for both populations you'll probably see good prediction accuracy, i.e. psupertime being able to estimate the timepoint label of a given cell. If the differences are more subtle, then this may be easier for the larger population.

A stronger difference might be in terms of the genes reported. With a large population, there's sufficient evidence (i.e. cells) to support many genes being given non-zero values. With a smaller population, the gain from including a gene in the regression may not be sufficient to outweigh the regularization penalty. So I would expect large differences between the two populations in the number of genes reported.

I hope this helps. I'm closing the issue (but very happy to reopen for more discussion).

Will