Open achamess opened 5 years ago
btw. the new download link for the singularity image doesn't output a .sif but rather the .vcf file.
Sorry, that was a mistake by 2am me. The link is now updated and i downloaded it and looked at it this time :/ sorry. The last link downloaded the right file, but named it something else...
Yes, -k is the number of expected donors in your sample. In the preprint I show several elbow plots showing the total loss value vs the number of cluster centers. In these plots, the value should drop precipitously until you reach the correct number of donors and then level off. But if you know how many donors your data has, just run it with that k.
Hope this helps.
Best, Haynes
Thanks!
How can I produce that Elbow plot like you did in your paper using the outputs from souporcell? I'm just curious now. I know the ground truth for my sample - it should be 2. But I ran with K = 4 the first time. It'd be reassuring to see that drop.
Well, a larger k will always produce a smaller loss (assuming they both reach the global minimum). There are methods that attempt to get around this with correction factors when your system is a probability model such as the BIC, but the question of how many clusters are in your data is actually a mathematically poorly conditioned question. So people try heuristic methods. The elbow plot is one of them. How you would create this is to run on multiple k, then look at the souporcell.log file and at the end there should be something that looks like this
(32465.56047216546, array([[ -4.66493068, -27.46838447, -19.21818849, -23.4081172 ], [-70.95994268, -60.4645178 , -67.52086246, -15.37279821], [-43.59709557, -46.68289744, -38.16539245, -11.40446358], ..., [-30.92251152, -32.32937607, -9.12595796, -35.49927364], [-49.48622929, -55.88473019, -51.04097071, -15.64365378], [ -7.57836496, -33.01162175, -34.73509733, -33.66428497]])) [0 3 3 ... 2 3 0]
That first number (32465) is the total loss. Then you just plot those vs their respective k's
I should make an option to run a range of k's and plot the elbow curve for you.
That could be useful. Thanks for the explanation. Take a long time though, no?
Thanks for making this software. I'm hopeful it can allow me to multiplex more human samples without needing to barcode/tag nuclei during the data acquisition process.
Incidentally, my data are tagged with MULTI-seq oligos, so I can compare the method, physical vs computational demuxing. I'll let you know if I do it.
That is great. We have had some users compare souporcell on multiplexed pbmc's that were cell hashed with antibody tags and got near 100% concordance. But you are the first I know of to run souporcell on nuclei.
labeling this as an enhancement: automate elbow plot creation
Well, a larger k will always produce a smaller loss (assuming they both reach the global minimum). There are methods that attempt to get around this with correction factors when your system is a probability model such as the BIC, but the question of how many clusters are in your data is actually a mathematically poorly conditioned question. So people try heuristic methods. The elbow plot is one of them. How you would create this is to run on multiple k, then look at the souporcell.log file and at the end there should be something that looks like this
(32465.56047216546, array([[ -4.66493068, -27.46838447, -19.21818849, -23.4081172 ], [-70.95994268, -60.4645178 , -67.52086246, -15.37279821], [-43.59709557, -46.68289744, -38.16539245, -11.40446358], ..., [-30.92251152, -32.32937607, -9.12595796, -35.49927364], [-49.48622929, -55.88473019, -51.04097071, -15.64365378], [ -7.57836496, -33.01162175, -34.73509733, -33.66428497]])) [0 3 3 ... 2 3 0]
That first number (32465) is the total loss. Then you just plot those vs their respective k's
I'm interested in trying running with a range of K as you describe. I running with the singularity container, but it doesn't produce a souporcell.log file. Where is the total loss value currently reported?
I know this is a pretty stale post, but I was able to generate an elbow plot of the total log probability, which is output in stderr when running souporcell. I admit I don't completely understand the math behind souporcell, but this probability I think would have the same relationship across a range of k that the loss would, no? (@wheaton5)
To capture stderr output across a range of k I did the following:
SIF=MY_SIF_LOACATION
OUT=OUTPUT_FOLDER_FROM_UPSTREAM_STEPS
ks=(1 2 3 4 5 6 7 8)
for k in ${ks[@]}; do
OUTDIR=${OUT}/souporcell_${k}
mkdir -p $OUTDIR
singularity exec $SIF /opt/souporcell/souporcell/target/release/souporcell -a alt.mtx -r ref.mtx -b outbcs.tsv.gz -k $k -t 35 > $OUTDIR/clusters_tmp.tsv 2> $OUTDIR/log.tsv
singularity exec $SIF troublet -a alt.mtx -r ref.mtx --clusters $OUTDIR/clusters_tmp.tsv > $OUTDIR/clusters.tsv'
done
Didn't take too long with 35 cores... Then to generate a plot in R, I run:
OUT <- "OUTPUT_FOLDER_FROM_UPSTREAM_STEPS"
max_k <- 8
logprobs <- sapply(1:max_k, function(k){
logfile <- file.path(OUT, paste0("souporcell_", k), "log.tsv")
string <- scan(logfile, sep="\t", what="char(0)", skip=length(readLines(logfile))-2)[1]
as.numeric(gsub("best total log probability = ", "", string))
})
df <- as.data.frame(logprobs)
df$k <- 1:max_k
ggplot(df, aes(x=k, y=logprobs))+geom_point(size=4)+geom_path()
Looks like this (2 known genotypes in this sample):
@scfurl I don’t understand the question. This all looks good to me. Looks like 3 or 2 clusters is correct. Id say 3, but u seemed to indicate it was 2. Anyway, if u have a question could u be more specific?
Thanks for the quick read @wheaton5. In my reading of the above post you referred to an elbow plot of the loss over a range of k, but I plotted the total log probability (I also now see that you published plots of the total log probability, not loss in your paper...).
I agree that as a heuristic for these data the log probability looks great. So to potentially close this issue, I was asking the question of whether this discussion should be refocused to emphasize plots of total log probability as the heuristic of choice (which is already implemented in the current version) and thus no enhancement is needed...
Yes this is correct. Preprint version used a loss function and in the review process we changed to a statistical model with a log likelihood. So what u are doing is correct.
Can you explain the -k parameter a bit more? That's for clusters, I see. But what does that mean? Is -k the expected number of different donors we have in our sample?
I used the default -k 4 from the code in the README. In my sample, however, I only had 2 human donors. I'm thinking this is a mistake now because I have 0,1,2,3 assignment in my output, which doesn't jive with the 2 donors I had.
Is that correct? -k should be number of expected donors/subjects in the sample?