wenweixiong / MARVEL

38 stars 9 forks source link

a big difference in cell number between cellranger and STARsolo #34

Open MengyuanLyu opened 8 months ago

MengyuanLyu commented 8 months ago

Hello,

Thanks for this package. I have a few quenstions about the data processing. For the droplet-based method, why we need to perform cellranger to obtain bam files and then use STARsolo? Can we directly use STARsolo to process raw sequencing files without cellranger? In my data, I use cellranger to obtain bam files and then use STARsolo. I compared the number of cells provided by cellranger and STARsolo and found that STARsolo only output a small number of cells in the 'solo.out/gene/filtered', which accounted for 10%-20% of these in the 'out/filtered' provided by cellranger. Is there such a big difference in the number of cells?

This is my code: cellranger count --id "$ID" \ --transcriptome /refdata-gex-GRCh38-2020-A \ --fastqs "$XXX" \ --sample "$ID" \ --nosecondary;

STAR --runThreadN 16 \ --genomeDir /refdata-gex-GRCh38-2020-A \ --soloType CB_UMI_Simple \ --readFilesIn possorted_genome_bam.bam\ --readFilesCommand samtools view -F 0x100 \ --readFilesType SAM SE \ --soloInputSAMattrBarcodeSeq CR UR \ --soloInputSAMattrBarcodeQual CY UY \ --soloFeatures Gene SJ \ --soloCBwhitelist 737K-august-2016.txt \ --outSAMtype BAM Unsorted \ --soloBarcodeReadLength 0

Thanks a lot.

wenweixiong commented 8 months ago

Indeed, you may proceed with STARsolo for processing the raw FASTQs files, instead of using the cellranger's BAM file. However, the algorithm for detecting cell barcodes and UMIs differ slightly between STARsolo and cellranger. Therefore, it is advisable to run STARsolo with cellranger's BAM file because the cell barcodes and UMIs have been detected and quantified by cellranger.

One more advantage of running STARsolo with cellranger's BAM file is that the list of cell barcodes will be the same for splicing junction analysis (list of cell barcodes returned by STARsolo) and also for gene expression analysis (list of cell barcodes returned by cellranger). This enables integration of splice junction and gene expression analysis downstream.

You may look into the raw/un-filtered output folder, instead of the filtered output folder, of STARsolo to retrieve the complete list of cells. Specifically, you may match the list of cell barcodes returned by STARsolo to that of cellranger. The former should overlap completely with the latter.

MengyuanLyu commented 8 months ago

Thanks for your reply. The cell list of STARsolo does overwrite the list of Cellranger. However, we are very curious as to why so many cells were lost after STARsolo filtration. We extracted cells that appeared only in the cell list of Cellranger, matched them to the SJ count matrix, and found that these cells had some SJ reads.

wenweixiong commented 8 months ago

You may look into the STARsolo output folder named "raw". This folder should contain the complete list of cell barcodes, including the ones filtered out.

Manoswini-02 commented 7 months ago

Can we use the output from cell ranger to directly to Marvel, as the format and output files are similar (as mentioned in STARsolo web page)?

wenweixiong commented 7 months ago

The output of cellranger is used by SingleCellaR to generate the normalised gene expression matrix. This matrix is used as input for MARVEL for downstream analysis (please see "Normalised gene expression" section under the "Input files" section of the tutorial: https://wenweixiong.github.io/MARVEL_Droplet.html)

STARsolo is used to obtain the raw gene count matrix and splice junction count matrix. These two matrices are used as inputs for MARVEL for computing percent spliced-in (PSI) values and other analyses downstream (please see "Gene counts" and "Splice junction counts" sections under the "Input files" section of the tutorial: https://wenweixiong.github.io/MARVEL_Droplet.html).

emmafjones commented 6 months ago

Just adding here, STARsolo has some additional arguments that make its cell filtering much closer to CellRanger. These options are --soloFeatures GeneFull_Ex50pAS SJ, --soloCellFilter EmptyDrops_CR, --soloUMIlen 12, --clipAdapterType CellRanger4, --outFilterScoreMin 30, --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts, --soloUMIfiltering MultiGeneUMI_CR, --soloUMIdedup 1MM_CR. For more information, you can refer to the STARsolo docs https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md

Gabrisfon commented 3 months ago

Hi guys, my name is Gabriel. I am trying to use this same pipeline to process my data and I have been facing the same problem... I thought it was strange, but I continued with the analysis, however when I create the marvel object I have been receiving different errors when trying to use the PlotPctExprCells.Genes.10x function and in the subsequent analysis. I believe that this has something to do with the very low number of cells that I have been receiving even after using the parameters mentioned above and others... Has anyone managed to solve this problem? If you have, I would be very happy if you could share it with me.

Thanks in advance!