wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
168 stars 46 forks source link

running souporcell in terminal #49

Open brekk129 opened 4 years ago

brekk129 commented 4 years ago

hello wheaton5 -

I am trying to use your program to cluster a single cell RNA seq sample that should have cells from two genotypes. I am running it on my institute's supercomputing machine (remotely) using terminal on my mac. I'm getting this error - do you know what could be going on? do I need to run this within python in my terminal?

I used this form after installing singularity - singularity exec /path/to/souporcell.sif souporcell_pipeline.py -i /path/to/possorted_genome_bam.bam -b /path/to/barcodes.tsv -f /path/to/reference.fasta -t num_threads_to_use -o output_dir_name -k num_clusters

I used num_threads = 8 (how do you choose this?) num_clusters = 2 (should be two "genotypes")

checking modules imports done checking bam for expected tags Traceback (most recent call last): File "/opt/souporcell/souporcell_pipeline.py", line 59, in for (index, line) in enumerate(barcodes): File "/usr/local/envs/py36/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)

thanks I would appreciate any help! I'm an immunologist trying to run my own code. could also email me - jriedl@umn.edu

wheaton5 commented 4 years ago

I probably know the issue, but not 100%.

Singularity only mounts the directories downstream of the working directory from which you run it. So if if you are running from /this/path and you are pointing at files in /this/otherpath/... or /otherpath/... it will not see that file. It will only see files in /this/path/. or /this/path/whatever/...

Also I don't know your cluster situation, but most scientific computing clusters will have head nodes which you are login to and then worker nodes on which large jobs are to be run. In order to run a large job such as souporcell you need to submit it to a cluster job scheduler such as SGE or LSF or SLURM. You may want to ask your system administrators/help desk about your particular system. Depending on your cluster settings you might be able to get away with running it on a head node but to reduce memory usage (usually limited on head nodes) and threads (also commonly limited on head nodes) you could run with --skip_remap True --common_variants <1kgenomes common variants linked in my readme> and --num_threads 4

As to your question of how I choose num threads, I normally choose 8, more will be faster but as not every stage is able to use all threads available, there will be some amount of diminishing returns with higher threads. Also in a job scheduler system you are gonna get an 8 thread job pretty fast but might have to wait a long time to even start a 32 thread job.

I hope this is helpful and I'm happy to answer any more question.

Best, Haynes

brekk129 commented 4 years ago

thanks so much for the quick reply!!! I really appreciate it.

I will try the skip_remap and I know how to set up for my supercomputing institute's cluster job scheduler.

so for my other problem with the directories - I will try to simply just set up all the files for souporcell and single cell bam files in the same directory.

brekk129 commented 4 years ago

hello! I think i've almost gotten this to work but keep seeing the same error when I run the code with skip remap and common variants. I've been trying to figure it out but would love any help! I'm sorry if this is a dumb question.

running vartrix Traceback (most recent call last): File "/opt/souporcell/souporcell_pipeline.py", line 556, in vartrix(args, final_vcf, bam) File "/opt/souporcell/souporcell_pipeline.py", line 481, in vartrix subprocess.check_call(cmd, stdout = out, stderr = err) File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['vartrix', '--mapq', '30', '-b', '/panfs/roc/groups/1/tolarj/jriedl/.singularity/souporcell_minimap_tagged_sorted.bam', '-c', '/panfs/roc/groups/1/tolarj/jriedl/.singularity/barcodes.tsv', '--scoring-method', 'coverage', '--threads', '8', '--ref-matrix', '/panfs/roc/groups/1/tolarj/jriedl/.singularity/ref.mtx', '--out-matrix', '/panfs/roc/groups/1/tolarj/jriedl/.singularity/alt.mtx', '-v', '/panfs/roc/groups/1/tolarj/jriedl/.singularity/common_variants_covered.vcf', '--fasta', '/panfs/roc/groups/1/tolarj/jriedl/.singularity/GRCh38_latest_genomic.fna', '--umi']' returned non-zero exit status 1.

I'm using this file GRCh38_latest_genomic.fna for the ref fasta from https://www.ncbi.nlm.nih.gov/genome/guide/human/.

the error file says (vartrix.err): [W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '3' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '4' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '5' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '6' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '7' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '8' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '9' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '11' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '12' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '13' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '14' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '15' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '16' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '17' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '18' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '19' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '20' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '21' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '22' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig 'X' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '3' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '4' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '5' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '6' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '7' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '8' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '9' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '11' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '12' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '13' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '14' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '15' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '16' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '17' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '18' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '19' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '20' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '21' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig '22' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse] Contig 'X' is not defined in the header. (Quick workaround: index the file with tabix.) 17:42:23 [ERROR] Sequence 1 not seen in FASTA

do you think this is an issue with the fasta I chose or with my data files?

thanks! I really appreciate the error files as well - so helpful!!! Was already able to work around a few problems.

wheaton5 commented 4 years ago

Check your bam vs your fasta (also vs the common_variants vcf). You either may be on a different reference (hg19 vs GRCh38) or a different contig naming scheme (chr1 vs 1).

You can also remove the --skip_remap and the --common_variants and it will remap to the reference you choose and call variants from that bam so everything should be on the same fasta at that point. I've had this problem accidentally many times because I mismatched the bam reference and the fasta or the common variants vcf.

brekk129 commented 4 years ago

yes!! Thank you for that, you were totally right. I changed to a .fa file and it worked!! So exciting.

This is so helpful for my work. I really really appreciate your willingness to answer questions!

Thanks again and take care.