wheaton5 / souporcell

Clustering scRNAseq by genotypes
MIT License
166 stars 46 forks source link

No junction reads in the minimap2 output #63

Open gokceneraslan opened 4 years ago

gokceneraslan commented 4 years ago

Hi,

Thanks for this super nice method and the pipeline. In the minimap2 bam outputs, I don't see any junction reads even though minimap2 is run with the -x splice parameter. Sashimi plots are all "empty". Any idea why minimap2 cannot make spliced alignments? Maybe using --junc-bed option would help a lot but even with the current setup there should be some junction reads in the bam file, given that there are a lot of spliced transcripts in scRNA-seq datasets. This might also improve the variant calling since it means more reads.

wheaton5 commented 4 years ago

Hey,

Yes, that is correct. minimap2 isn't designed to do spliced alignments on short reads. I tried to cobble together a mixture of short read and long spliced read parameters, but it didn't work. But I think you will find those reads aren't entirely gone, they will be soft clipped on one side and often (if there is enough unique sequence left) there will be another entry which is a secondary read which is hard clipped on the other side.

Alternatively you can use HISAT2 which will do a better job with this. The reason I don't use HISAT2 is it is GPL and I wanted to maintain MIT open source license. It is pretty trivial to change if you want to do that yourself.

Best, Haynes

gokceneraslan commented 4 years ago

Hi Haynes,

Thanks a lot for the quick reply! I see. Actually, given that I am no expert of any aligners, I would love to give HISAT2 a try and see if it improves variant calling at all. Do you mind sharing how you run HISAT2? Actually, if you write it in the README under the "Hard install" section, other people can also give it a try.

You are right that you cannot redistribute GPL licensed software within your singularity image if souporcell is overall MIT licensed, but you can still add an aligner option to the souporcell.py which uses HISAT2 if it's in the $PATH. That would be the easiest for those who want to try it and doesn't violate the license, I think.

Cheers.

wheaton5 commented 4 years ago

That is a good idea. Currently the command is in there just commented out. So lines 235-237. Okay, how about this. I can add that and make a new singularity build and then you can test it? That would save me a bit of time. As you can see, I have tried this out before, and it was ever so slightly better variant calling. I have NA12878 cell line run through 10x genomics (they have done that experiment so I just requested it). So then I have the ground truth variants from the genome in a bottle. Off the top of my head I think it was just a bit more sensitive with the same specificity due to splice sites where minimap would prefer to clip a few more bases off due to the SNP or indel. But the difference wasn't dramatic at all. I'll let you know when I have the singularity build for you.

edit: okay i threw it at a standard test (not the NA12878) but ill leave the test of the singularity build to you, it takes about an hour to build.

gokceneraslan commented 4 years ago

Oh I didn't know about the --aligner option in souporcell-pipeline.py (https://github.com/wheaton5/souporcell/blob/master/souporcell_pipeline.py#L31) at all. Btw, I'm not using the singularity build at all, I'm using a custom docker image (available at https://github.com/klarman-cell-observatory/cumulus/blob/master/docker/demultiplexing/souporcell/2020.03/Dockerfile). So I can just add hisat2 there and tweak the --aligner option.

wheaton5 commented 4 years ago

Oh I didn't know about the --aligner option in souporcell-pipeline.py Lol, that is because I just added it per your suggestion

gokceneraslan commented 4 years ago

Thanks for everything. So the final questions:

wheaton5 commented 4 years ago

Do you think it makes a big difference to use the "genome_snp_tran" HISAT2 index compared to the "genome_tran"? I don't know. I haven't tested genome_snp_tran indexing. If you are very interested in snp accuracy in single cell RNAseq, it is also helpful to ignore common RNA editing sites. I have a vcf of these and a vcf of common variants with those locations removed. This also only makes a small difference.

2nd point you should index your fasta file using the base name without the extension. I actually just ran into the .fasta vs .fa issue so I'll fix that.

I'm testing this stuff now and I'll update here when it is working.

gokceneraslan commented 4 years ago

OK thanks. I was just considering existing hisat2 index files.

wheaton5 commented 4 years ago

I was just considering existing hisat2 index files. I don't understand this. Is it not okay to assume the index files are going to start with the fasta file name minus its extension?

gokceneraslan commented 4 years ago

The filenames of the existing index files are grch38/genome.1.ht2, grch38/genome.2.ht2 etc., the -x parameter should be grch38/genome.

gokceneraslan commented 4 years ago

One idea is to make -f parameter of souporcell equal to the -x parameter of hisat2 if --aligner is HISAT2.