wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
153 stars 44 forks source link

Souporcell running w/ aneuploidy sample #223

Open hypaik opened 4 months ago

hypaik commented 4 months ago

Dear Heaton

Thank you for the development for this nice tool. I've been totally enjoyed to analyze my samples with Souporcell. However, recently, I've got very unique sample. It is embryo cells in very early stage with aneuploidy of autosomal chromosome. When I tried to run souporcell (it is mixed sample with fetus and maternal cell), the Souporcell spits out error signal then there is no result for clustering at all. Based on the published paper of sourporcell (Nat. Method, 2020), I guess it is an issue of diploidy assumption. I you can share your bioinformatic insight for this issue let me know. Thank you.

wheaton5 commented 4 months ago

Can you give more info on the error? I dont think diploid assumption should make any difference until the last step which is the estimation of ambient rna which is after clustering and doublet detection.

hypaik commented 4 months ago

Thank you for your prompt response.

Here is the head of error messages FYI, GRCh38_cellRanger.fa is a reference genome file I used. In addition, the same ref file has no problem with other souporcell running. Moreover, with out souporcell, *h5d file of this sample showed low doublet rate via Scrubelt.

" [proj_xxx]$ head souporcell_K2RPLEndo1.err /xxx/proj_Termination/GRCh38_cellRanger.fa: line 1: 1: command not found /blues/ngs/data/proj_Termination/GRCh38_cellRanger.fa: line 2: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN: command not found /xxxa/proj_Termination/GRCh38_cellRanger.fa: line 3: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN: command not found /xxx/proj_Termination/GRCh38_cellRanger.fa: line 4: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN: command not found...."

wheaton5 commented 4 months ago

What step is this coming from? Like souporcell outputs .err files for each step. Which file is this? This is pretty weird like the fasta is being treated as an executable…

hypaik commented 4 months ago

Thank you for your fast response. I figured out something wrong happened in the bash shell script of mine. I used a SLURM-base system of my institute. It makes this weird thing... Now I fixed this bug :-). However I still curious why diploidy assumption do not impact the results. Can I send an independent e-mail you for this issue? I found that your affiliation was changed based on Google Scholar.

wheaton5 commented 4 months ago

Sure, just updated my email on google scholar. You can find me at whheaton@gmail.com or haynesheaton@auburn.edu

wheaton5 commented 4 months ago

But the short answer is lets look at the steps.

  1. remap - clearly doesnt require ploidy assumption
  2. candidate variants (freebayes) - we dont know how many individuals are in the sample and in what ratios so we cant assume allele fractions expected.
  3. allele assignment to cells (vartrix) - also just whatever the data is
  4. clustering - this could have a diploid assumption but there are also doublets and ambient RNA and false positive variants including RNA editing sites making this noisier. We have found that having no assumption of allele fractions is more accurate than including it
  5. doublet detection - we treat this as a statistical urn problem and simply ask the question "was this cell more likely drawn from the alleles of 2 clusters or 1 cluster"
  6. ambient RNA estimation and genotyping - both of these require a ploidy estimation because both rely on expectations of allele fractions. And currently we only support ploidy 1 and 2, not polyploid. Polyploid could be added, but most polyploid are allopolyploid not autopolyploid and thus will have separate reference chromosomes for each parental lineage.