clustering plot (Figure 1e)

daccachejoe commented 1 year ago

Hi -

I've gotten the tool up and running and it seems to be working well. However, I am trying to understand how well my genotypes are defined/separated. How can I do so? I tried to recreate the clustering plot in the paper in figure 1 but I am not sure what is being used to create those visualizations. Is there a meaningful way to compare the genotypes inferred?

Ultimately this is a qc measure for my data. I want to ensurethat my demultiplexing is robust and having a visualization for that would be helpful.

Thanks! Joe

wheaton5 commented 1 year ago

I've been meaning to make a script for this for a long time. I take the clusters_tmp.tsv file and i take the log likelihood columns, then row-wise i normalize them by dividing either by the mean or max (cant remember). Then that matrix I do a PCA on. Mathematically it doesn't make much sense, but it does provide a nice visualization.

daccachejoe commented 1 year ago

Would you mind sharing the code you used to generate that plot? I can try it on my data. It could be the package you used to run PCA but when I did as you stated with min-max normalization, I got this very odd looking PCA plot. Knowing me, I probably went astray along the way to generating this so looking at the original code would be very helpful. Thanks!

zheng-sc commented 1 year ago

may I have the code for cluster visualization? Thanks a lot!!!

wheaton5 commented 1 year ago

sorry, grant stuff came up and i got busy. I will try to get on this this weekend.

wheaton5 commented 1 year ago

i looked and i dont have the code anymore so i need to recreate it.

daccachejoe commented 1 year ago

Let me see if my code can help at all. My lab meeting with the plot above was lauded as likely incorrect but maybe you can share your thoughts on it.

library(dplyr)
library(ggplot2)
library(readr)
library(FactoMineR)

# Now for the barcode assignment and plots
pca.df <-  read_delim(paste0(souporcell.dir,"clusters.tsv"), delim = "\t")

# PCA plotting of cells by their genotype scores
pca.df <- pca.df %>% mutate(assignment = 
                              ifelse(status == "unassigned", 
                                     "NA", 
                                     ifelse(grepl("/", assignment),
                                            "NA",
                                            assignment))) %>%
  select(colnames(pca.df)[grep("^c",
                               colnames(pca.df))], 
         assignment)

# max normalize the data, its crude and potentially the source of why my plot is weird
pca.df[,1:4] <- t(apply(pca.df[,1:4], 1, function(x){(x/max(x))}))
res.pca = PCA(pca.df, scale.unit=F, ncp=20, graph=T, quali.sup = 5)
plot.PCA(res.pca, axes=c(1, 2), choix="ind")

res.pca.ggplot.df <- as.data.frame(res.pca[["ind"]][["coord"]])
res.pca.ggplot.df$assignment <- pca.df$assignment
res.pca.ggplot.df %>%
  ggplot(aes(x = Dim.1, y = Dim.2, color = assignment)) +
  geom_point(size = 0.5) +
  scale_color_manual(values = c("red", "green", "blue", "yellow", "gray")) +
  theme_classic()
dev.off()

wheaton5 / souporcell

clustering plot (Figure 1e) #206