wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
164 stars 46 forks source link

pipeline struggling or not finishing on large genome/supertranscriptome, possible fixes suggested #140

Open joefranchesco opened 2 years ago

joefranchesco commented 2 years ago

Hello! We have been having an issue running souporcell with either a large genome or matching supertranscriptome. Our scripts for the full pipeline run for over 7 days and then run out of time on our cluster, or they run out of memory. I saw other issues posted that may relate, but I don't think the only issue is chromosome length, because the supertranscriptome has the same problems. There might be difficulties with samtools indexing and with minimap. I have three questions:

  1. Are there any plans to address this including options to pass along modifier flags for minimap through the pipeline?

  2. We have found a temporary workaround where I can run the individual pieces of the pipeline separately up through freebayes, and then submit a script with --skip_remap and --common_variants {freebayes produced VCF}. The output from this appear to be accurate, but I am curious if this is how you would approach this problem? Or do you have a different suggested workaround?

  3. When we submit the souporcell_pipeline.py script after running retag/minimap/freebayes, we were not clear if the bam file submitted should be our original mapped bam file used to feed into minimap etc, or whether it should be the minimap_tagged_sorted.bam file output from minimap2. I assume the former. I am running both simultaneously and will compare the outputs, but was hoping for your input.

Thanks for any help/advice!

wheaton5 commented 2 years ago
  1. Not sure the best way to do this, will think about it.
  2. If you can do the manual install you can just edit the souporcell_pipeline.py with your minimap2 options
  3. It should use the minimap2 bam file.
joefranchesco commented 2 years ago

Wow that was a fast response. That's totally reasonable. I had been using the singularity image so I will try your suggestion from point 2. Thank you for the help!

joefranchesco commented 2 years ago

Just want to confirm: the freebayes generated VCF file from our sample is acceptable to supply to souporcell with the --common variants flag? So that after getting my data through minimap2 and freebayes, I'm rerunning souporcell with flags:-i ${minimap sorted BAM} -b ${BarcodesDir}barcodes.tsv.gz --skip_remap SKIP_REMAP --common_variants ${freebayes produced VCF}

wheaton5 commented 2 years ago

The pipeline is set up to automatically reuse previous partial runs if the associated .done file is created. So u might not need the skip remap or common variants flags if those parts were completed previously with the same output directory. But to answer your direct question, yes, it should be fine to give the freebayes output as common variants. It should work exactly the same.

joefranchesco commented 2 years ago

Yeah I like that aspect of the pipeline, that has been quite helpful. I will try both options, but yes thanks for the direct answer and for your help!