wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
168 stars 46 forks source link

Dimensionality reduction viz based on genotypes #133

Closed chris-rands closed 2 years ago

chris-rands commented 2 years ago

Thank you for souporcell. In your paper figure 2, you have nice PCA plots where the embeddings are based on the cells vs. variants matrix. How can I generate plots like this from the output of souporcell?

wheaton5 commented 2 years ago

Well, it's a bit of a mathematical atrocity, but for the purposes of visualization, it is quite nice. I took the log likelihoods from the clusters.tsv file for each cluster (so ignore the first few columns). Then for each row, I normalized those log likelihoods to put each row in the same general space (some cells will have much more data than other cells, so this step is trying to put them in the same general space). So by normalize, I mean divide each row by the sum of that row (the mathematical atrocity here is that this makes no sense in terms of log probabilities) and multiply by some factor (I think it was 4 or something, I just chose this based on how the end diagram looked). That factor will make the plot more or less spread out. Once these transformations were done, I just did a PCA of the resulting matrix and plotted the significant PCA dimensions vs one another.

chris-rands commented 2 years ago

Thank you @wheaton5. This is my attempt to implement your logic. Does it look correct? The viz looks better without the normalisation line that I hashed out?

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns

def pca_log_prob(tsv_file, scaling_factor=4):
    df = pd.read_csv(tsv_file, sep="\t")
    df["sum_prob"] = df.iloc[:, 3:].sum(axis=1)
    #df.iloc[:, 3:-1] = df.iloc[:, 3:-1].div(df.sum_prob, axis=0) * scaling_factor
    x = df.iloc[:, 3:-1]
    x = StandardScaler().fit_transform(x)
    pca = PCA(n_components=2)
    pcs = pca.fit_transform(x)
    df_pcs = pd.DataFrame(data = pcs, columns = ["PC1", "PC2"])
    df_final = pd.concat([df_pcs, df[["assignment"]]], axis = 1)
    sns.scatterplot(data=df_final, x="PC1", y="PC2", hue="assignment")
chris-rands commented 2 years ago

I'm also thinking there must be a better representation than using the log likelihoods, like doing dimensionality reduction on the variants themselves. Perhaps MCA on a binary matrix of the variant calls, or some kind of PCA using the VCF file?

wheaton5 commented 2 years ago

Variants are sparse so idk, could work but unlikely.

On Tue, Mar 22, 2022 at 6:50 AM Chris Rands @.***> wrote:

I'm also thinking there must be a better representation than using the log likelihoods, like doing dimensionality reduction on the variants themselves. Perhaps MCA on a binary matrix of the variant calls, or some kind of PCA using the VCF file?

— Reply to this email directly, view it on GitHub https://github.com/wheaton5/souporcell/issues/133#issuecomment-1075080371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEWNJSVV6AHVXCU3VZIVEDVBGXZ5ANCNFSM5RILM5NQ . You are receiving this because you were mentioned.Message ID: @.***>

--

-Haynes Heaton

wheaton5 commented 2 years ago

Actually Im pretty sure it wont work well. All 4 methods developed to solve this problem have used a cluster center approach. I personally used pairwise similarity in a graph approach and a sparse projection approach prior to clustering and while both showed some structure, neither were anywhere near as accurate as the cluster center approach was. I appreciate the thought process, but ill save some of your time and say ive already tried the obvious things (pca of sparse random projections, sparse pca, tsne/umap of sparse random projections).

On Tue, Mar 22, 2022 at 9:17 AM Haynes Heaton @.***> wrote:

Variants are sparse so idk, could work but unlikely.

On Tue, Mar 22, 2022 at 6:50 AM Chris Rands @.***> wrote:

I'm also thinking there must be a better representation than using the log likelihoods, like doing dimensionality reduction on the variants themselves. Perhaps MCA on a binary matrix of the variant calls, or some kind of PCA using the VCF file?

— Reply to this email directly, view it on GitHub https://github.com/wheaton5/souporcell/issues/133#issuecomment-1075080371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEWNJSVV6AHVXCU3VZIVEDVBGXZ5ANCNFSM5RILM5NQ . You are receiving this because you were mentioned.Message ID: @.***>

--

-Haynes Heaton

--

-Haynes Heaton

chris-rands commented 2 years ago

ok thanks, i will stick to the code i posted above for a quick viz

ChantKatrjian commented 1 year ago

Hello Haynes, souporcell works great, I have no computational issues. I just want to ask what the difference is between the barcode-variant outputs 'clusters.tsv' and 'clusters_tmp.tsv'. From my understanding, both produce values of log-likelihood for each barcode under their respective clusters. however, it seems like clusters_tmp is forcing doublets or unassigned cells to fit cluster k. For visualizations of clusters separated by genotype, I followed your advice in this thread 'normalizing with the transformation'. I tried normalizing with u=0 sd=1, which gave my clusters an almost linear topology, almost as if I can see the axes to which pca assigned them. Further, for transcriptome dimensionality reductions (tSNEs and UMAPs) do you suggest continuing with this transformation? In issue #154 you note that the pca in the paper was on clusters_tmp and normalizes each row to sum to 1. Is there a reason for this change or are you flowing with the best separation and visual?

mean of 0 and sd of 1

k2_cluster_90_10

divide each row by sum of row then x4

k2_cluster_90_10_factor

wheaton5 commented 1 year ago

@ChantKatrjian When you only have 2 clusters, they can be separated in a single dimension instead of 2d. So you might want to do a histogram or density plot of just PCA1 instead of a scatter plot. I chose this normalization because it produced nice visually appealing plots. For the transcriptome plots, I would stick with the standard tSNE/UMAP with whatever normalization your tool of choice (suerat for instance) does.

clusters_tmp is the output of the base clustering then the doublets are called and unassigned are called as unassigned if the data was not meet some probability threshold of confidence.