Missing cluster_genotypes.vcf

drneavin commented 4 years ago

Hi @wheaton5, I have noticed that on some of my souporcell pipeline runs the cluster_genotypes.vcf is missing and when I restart souporcell for that sample, souporcell says that it has finished. However, if I run consensus.py, then it does create the cluster_genotypes.vcf.

I currently have a good work around with my pipeline which is to run consensus.py for those that are missing the cluster_genotypes.vcf so no rush on this but I just wanted to flag this with you to look in to for whenever you release the next version of souporcell and also if anyone else comes up against this problem.

wheaton5 commented 4 years ago

Interesting. I have never seen this. Is it reproducible?

drneavin commented 4 years ago

Good question. Can you define reproducible in this context? Do you mean that when rerun from the start that it wouldn't create the cluster_genotypes.vcf for the same pools?

When I try to restart the souporcell pipeline for those pools, it gives the output checking each step and then says "done". This happened for ~10 of our pools but the other 65 pools completed without issue. Another potentially confounding factor is that I'm running this with snakemake but I'm not sure how that could be contributing to the issue since I can see the souporcell output indicating that it is checking each step and is done.

Let me know if there are any additional tests that I can run to help identify what might be going on here

wheaton5 commented 4 years ago

Do you mean that when rerun from the start that it wouldn't create the cluster_genotypes.vcf for the same pools?

This is what I meant.

For those pools is there any error message in stdout or stderr? I don't direct the output of the consensus step to a file maybe i should do that. The consensus.py is called under a subprocess.check_call so it must be failing silently. The consensus.py step uses STAN which is a domain specific language for statistical models and does random initialization of the free variables before optimized gradient descent to find the MLE values of the free variables under the model and data. This is the only part of souporcell which is unseeded random. I think sometimes it can initialize the free variables such that the log likelihood is -inf even tho I have tried to give it ranges for initialization which shouldn't do that. I guess I could check for that file and rerun that step until it generates properly.

probably if you deleted the consensus.done file and restarted it would generate it, but that is about equivalent to your current workaround tho maybe a bit easier.

drneavin commented 4 years ago

Ok, I think I figured out the issue. I am pretty sure it is a snakemake related issue where I used the cluster_genotypes.vcf as a test to make sure that the rule was finished but somehow an error occurred where the jobs got restarted so it deleted the file that I was using as a readout to indicate that the job was done which was cluster_genotypes.vcf.

But I now have another unrelated question. Four of my pools resulted in a clusters.tsv file that doesn't contain any cells assigned to a given cluster. For example, in one pool, there are 14 individuals and the cluster numbers span from 0 to 13 but none of the droplets are assigned to cluster 2. This results in missing information for all the SNPs in the cluster_genotypes.vcf for cluster 2. I reran those pools from scratch to see if the result was the same. I got the same result for three of the four pools but on now had individuals assigned to all clusters represented. Have you come across this before? Any idea what might be going on?

wheaton5 commented 4 years ago

Ok nice that you figured that out.

As for the clustering problem, it might that there were no cells from one individual or very few from one individual making the signal to noise to low to detect, or it could be getting stuck in a local optimum. The clustering problem begins to get very hard starting at 10 individuals and gets harder the more you add. What is the median UMI/cell you have in your data? If this is low (under 2k for smaller number of clusters, under 4k for more clusters) there may not be enough information to cluster properly. Also are any of your individuals related? This also makes clustering harder as they are more genetically similar.

I'm not sure how you are running it right now, but sometimes using the --common_variants option with the 1kgenomes common variants file (assuming you are using human). This reduces the false positive variants that might be confusing the clustering.

Sorry for the problems you have encountered. Hopefully we can get it working well.

wheaton5 / souporcell

Missing cluster_genotypes.vcf #50