Interpretation of Clustering Results and PCA Analysis in Souporcell

Thank you so much for making such a powerful tool. We are trying to solve some practical problem with this.

We have two batches of samples, pre and post-treatment, each containing 3 samples. Both batches were subjected to 10x feature barcode technic to create libraries. Among these samples, we suspect that one sample from the pre-treatment batch and one from the post-treatment batch are from the same individual. The remaining 4 samples are confirmed to be two pairs for pre-post comparison. Due to various reasons, the records are unclear, so we tried to use Souporcell to figure out whether these 2 are collected from the same person.

We merged the BAM files from these two batches into one file and then separated them into 3 and 4 clusters respectively. We found that when we set the number of clusters to 3, the suspected duplicate samples were placed into the same cluster. However, when we set the number of clusters to 4, these two suspected samples were split into different clusters (cluster 0 and cluster 2), while the pre and post samples from the other two individuals remained correctly paired within the same clusters.

I have a few questions regarding this situation:

How could the same sample end up being split into two different clusters (cluster 0 and cluster 2)? I referred to issue #217 (How to calculate likelihood of unknown number of mixed samples split) but did not fully understand how to interpret the results. Could you please provide a detailed explanation?

Based on other responses, we performed PCA analysis and found that the variance for dimension 4 (dim4) was quite low, only 0.23%. There is a difference of opinion in our group regarding this result: some believe that dim4 can be ignored, and thus clusters 0 and 2 should be considered the same, while others disagree.

We further separated the cells from clusters 0 and 2 and analyzed them using Seurat, finding that they seem to have no significant differences.

Could you please provide your insights on these points?

Thank you so much for your help! Li

I'm not sure I completely follow what you have done. Maybe a zoom call would be best such that we can clarify things in real time? But I'll try to give a response here as well.

"We have two batches of samples, pre and post-treatment, each containing 3 samples. Both batches were subjected to 10x feature barcode technic to create libraries. Among these samples, we suspect that one sample from the pre-treatment batch and one from the post-treatment batch are from the same individual. The remaining 4 samples are confirmed to be two pairs for pre-post comparison."

This is confusing. Lets break this down. Do you mean
Pre - individuals A,B,C and Post - individuals A,D,E or do you mean Pre - individuals A,B,C Post - individuals A,B,D or something else?

"We merged the BAM files from these two batches into one file and then separated them into 3 and 4 clusters respectively. We found that when we set the number of clusters to 3, the suspected duplicate samples were placed into the same cluster. However, when we set the number of clusters to 4, these two suspected samples were split into different clusters (cluster 0 and cluster 2), while the pre and post samples from the other two individuals remained correctly paired within the same clusters."

This is a small consideration, but when merging bams, you should first change the cell barcodes in one bam from ACCGT...-1 to ACCGT...-2 and the same with the barcodes.tsv file before merging them. This will avoid barcode collisions (cells from one experiment being treated as the same as cells in the other experiment because they had the same barcode sequence. this will be a small %, but a few % so why not just avoid if possible).

"How could the same sample end up being split into two different clusters (cluster 0 and cluster 2)? I referred to issue https://github.com/wheaton5/souporcell/issues/217 (How to calculate likelihood of unknown number of mixed samples split) but did not fully understand how to interpret the results. Could you please provide a detailed explanation?"

When clustering with an algorithm that takes in the number of clusters and you give it a number higher than the "true" number of clusters, it will find some split in the data on the lines of some random aspect of the data that improves the overall loss function. But it won't improve the loss function as much as it would if it were splitting out true clusters. This is why we use the elbow plots as seen in the paper. To make these elbow plots, cluster with k=1,2,3,4,5 say, and take the total log likelihood (from souporcell.out last line) and plot that vs k. If you see a noticable elbow, that is the correct number of clusters. I suspect from this description that k=3 will look the best and so you have the same 3 individuals in the pre and post samples.

"Based on other responses, we performed PCA analysis and found that the variance for dimension 4 (dim4) was quite low, only 0.23%. There is a difference of opinion in our group regarding this result: some believe that dim4 can be ignored, and thus clusters 0 and 2 should be considered the same, while others disagree.

We further separated the cells from clusters 0 and 2 and analyzed them using Seurat, finding that they seem to have no significant differences."

Maybe, maybe not. I assume you analyzed looking at the expression pattern not the genetic differences? That may not be conclusive. For the PCA, what are you doing here? I suggest taking the values from clusters_tmp.tsv for the log likelihood for cluster 0,1,2,3 normalize each row by dividing by that row's mean value. Then do a PCA. Go ahead and plot this. And slightly counterintuitively, it will be the 3rd dimension that will be significantly smaller than the first 2 dimensions if it is truly 3 clusters. That is because a single dimension can separate 2 clusters and 2 dimensions can separate 3 clusters. With more than 3 clusters it starts to be difficult to interpret exactly 1 dim per cluster after the first dim because each dim may mostly separate an additional cluster but also partially separate other clusters. From the paper in figure 1, only 2 dimensions separated 5 clusters, but this was lucky. In figure 2 you see I needed two plots (one dim 1 vs 2 and one dim 3 vs 4) to separate all 5 clusters well. So this method isn't going to be a very reliable one. I would focus on the elbow plot first.

wheaton5 / souporcell

Interpretation of Clustering Results and PCA Analysis in Souporcell #241