Closed chris-rands closed 2 years ago
Well, it's a bit of a mathematical atrocity, but for the purposes of visualization, it is quite nice. I took the log likelihoods from the clusters.tsv file for each cluster (so ignore the first few columns). Then for each row, I normalized those log likelihoods to put each row in the same general space (some cells will have much more data than other cells, so this step is trying to put them in the same general space). So by normalize, I mean divide each row by the sum of that row (the mathematical atrocity here is that this makes no sense in terms of log probabilities) and multiply by some factor (I think it was 4 or something, I just chose this based on how the end diagram looked). That factor will make the plot more or less spread out. Once these transformations were done, I just did a PCA of the resulting matrix and plotted the significant PCA dimensions vs one another.
Thank you @wheaton5. This is my attempt to implement your logic. Does it look correct? The viz looks better without the normalisation line that I hashed out?
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns
def pca_log_prob(tsv_file, scaling_factor=4):
df = pd.read_csv(tsv_file, sep="\t")
df["sum_prob"] = df.iloc[:, 3:].sum(axis=1)
#df.iloc[:, 3:-1] = df.iloc[:, 3:-1].div(df.sum_prob, axis=0) * scaling_factor
x = df.iloc[:, 3:-1]
x = StandardScaler().fit_transform(x)
pca = PCA(n_components=2)
pcs = pca.fit_transform(x)
df_pcs = pd.DataFrame(data = pcs, columns = ["PC1", "PC2"])
df_final = pd.concat([df_pcs, df[["assignment"]]], axis = 1)
sns.scatterplot(data=df_final, x="PC1", y="PC2", hue="assignment")
I'm also thinking there must be a better representation than using the log likelihoods, like doing dimensionality reduction on the variants themselves. Perhaps MCA on a binary matrix of the variant calls, or some kind of PCA using the VCF file?
Variants are sparse so idk, could work but unlikely.
On Tue, Mar 22, 2022 at 6:50 AM Chris Rands @.***> wrote:
I'm also thinking there must be a better representation than using the log likelihoods, like doing dimensionality reduction on the variants themselves. Perhaps MCA on a binary matrix of the variant calls, or some kind of PCA using the VCF file?
— Reply to this email directly, view it on GitHub https://github.com/wheaton5/souporcell/issues/133#issuecomment-1075080371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEWNJSVV6AHVXCU3VZIVEDVBGXZ5ANCNFSM5RILM5NQ . You are receiving this because you were mentioned.Message ID: @.***>
--
-Haynes Heaton
Actually Im pretty sure it wont work well. All 4 methods developed to solve this problem have used a cluster center approach. I personally used pairwise similarity in a graph approach and a sparse projection approach prior to clustering and while both showed some structure, neither were anywhere near as accurate as the cluster center approach was. I appreciate the thought process, but ill save some of your time and say ive already tried the obvious things (pca of sparse random projections, sparse pca, tsne/umap of sparse random projections).
On Tue, Mar 22, 2022 at 9:17 AM Haynes Heaton @.***> wrote:
Variants are sparse so idk, could work but unlikely.
On Tue, Mar 22, 2022 at 6:50 AM Chris Rands @.***> wrote:
I'm also thinking there must be a better representation than using the log likelihoods, like doing dimensionality reduction on the variants themselves. Perhaps MCA on a binary matrix of the variant calls, or some kind of PCA using the VCF file?
— Reply to this email directly, view it on GitHub https://github.com/wheaton5/souporcell/issues/133#issuecomment-1075080371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEWNJSVV6AHVXCU3VZIVEDVBGXZ5ANCNFSM5RILM5NQ . You are receiving this because you were mentioned.Message ID: @.***>
--
-Haynes Heaton
--
-Haynes Heaton
ok thanks, i will stick to the code i posted above for a quick viz
Hello Haynes, souporcell works great, I have no computational issues. I just want to ask what the difference is between the barcode-variant outputs 'clusters.tsv' and 'clusters_tmp.tsv'. From my understanding, both produce values of log-likelihood for each barcode under their respective clusters. however, it seems like clusters_tmp is forcing doublets or unassigned cells to fit cluster k. For visualizations of clusters separated by genotype, I followed your advice in this thread 'normalizing with the transformation'. I tried normalizing with u=0 sd=1, which gave my clusters an almost linear topology, almost as if I can see the axes to which pca assigned them. Further, for transcriptome dimensionality reductions (tSNEs and UMAPs) do you suggest continuing with this transformation? In issue #154 you note that the pca in the paper was on clusters_tmp and normalizes each row to sum to 1. Is there a reason for this change or are you flowing with the best separation and visual?
mean of 0 and sd of 1
divide each row by sum of row then x4
@ChantKatrjian When you only have 2 clusters, they can be separated in a single dimension instead of 2d. So you might want to do a histogram or density plot of just PCA1 instead of a scatter plot. I chose this normalization because it produced nice visually appealing plots. For the transcriptome plots, I would stick with the standard tSNE/UMAP with whatever normalization your tool of choice (suerat for instance) does.
clusters_tmp is the output of the base clustering then the doublets are called and unassigned are called as unassigned if the data was not meet some probability threshold of confidence.
Thank you for souporcell. In your paper figure 2, you have nice PCA plots where the embeddings are based on the cells vs. variants matrix. How can I generate plots like this from the output of souporcell?