wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
160 stars 45 forks source link

Interpretation of Clustering Results and PCA Analysis in Souporcell #241

Open lllifan opened 1 month ago

lllifan commented 1 month ago

Thank you so much for making such a powerful tool. We are trying to solve some practical problem with this.

We have two batches of samples, pre and post-treatment, each containing 3 samples. Both batches were subjected to 10x feature barcode technic to create libraries. Among these samples, we suspect that one sample from the pre-treatment batch and one from the post-treatment batch are from the same individual. The remaining 4 samples are confirmed to be two pairs for pre-post comparison. Due to various reasons, the records are unclear, so we tried to use Souporcell to figure out whether these 2 are collected from the same person.

We merged the BAM files from these two batches into one file and then separated them into 3 and 4 clusters respectively. We found that when we set the number of clusters to 3, the suspected duplicate samples were placed into the same cluster. However, when we set the number of clusters to 4, these two suspected samples were split into different clusters (cluster 0 and cluster 2), while the pre and post samples from the other two individuals remained correctly paired within the same clusters.

I have a few questions regarding this situation:

How could the same sample end up being split into two different clusters (cluster 0 and cluster 2)? I referred to issue #217 (How to calculate likelihood of unknown number of mixed samples split) but did not fully understand how to interpret the results. Could you please provide a detailed explanation?

Based on other responses, we performed PCA analysis and found that the variance for dimension 4 (dim4) was quite low, only 0.23%. There is a difference of opinion in our group regarding this result: some believe that dim4 can be ignored, and thus clusters 0 and 2 should be considered the same, while others disagree.

We further separated the cells from clusters 0 and 2 and analyzed them using Seurat, finding that they seem to have no significant differences.

Could you please provide your insights on these points?

Thank you so much for your help! Li

wheaton5 commented 1 month ago

I'm not sure I completely follow what you have done. Maybe a zoom call would be best such that we can clarify things in real time? But I'll try to give a response here as well.

"We have two batches of samples, pre and post-treatment, each containing 3 samples. Both batches were subjected to 10x feature barcode technic to create libraries. Among these samples, we suspect that one sample from the pre-treatment batch and one from the post-treatment batch are from the same individual. The remaining 4 samples are confirmed to be two pairs for pre-post comparison."

"We merged the BAM files from these two batches into one file and then separated them into 3 and 4 clusters respectively. We found that when we set the number of clusters to 3, the suspected duplicate samples were placed into the same cluster. However, when we set the number of clusters to 4, these two suspected samples were split into different clusters (cluster 0 and cluster 2), while the pre and post samples from the other two individuals remained correctly paired within the same clusters."

"How could the same sample end up being split into two different clusters (cluster 0 and cluster 2)? I referred to issue https://github.com/wheaton5/souporcell/issues/217 (How to calculate likelihood of unknown number of mixed samples split) but did not fully understand how to interpret the results. Could you please provide a detailed explanation?"

"Based on other responses, we performed PCA analysis and found that the variance for dimension 4 (dim4) was quite low, only 0.23%. There is a difference of opinion in our group regarding this result: some believe that dim4 can be ignored, and thus clusters 0 and 2 should be considered the same, while others disagree.

We further separated the cells from clusters 0 and 2 and analyzed them using Seurat, finding that they seem to have no significant differences."