Closed andreanuzzo closed 2 years ago
Hi,
Did you try cellranger with the custom reference? You can use full sequences as described in the section, Add a marker gene to the FASTA and GTF. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr
I also have experienced that most viral reads were from ambient RNA from empty droplets rather than droplets allocated as cells.
Thank you for your comments.
I managed to build the custom reference with cellranger
(I can send the commands if needed). I proceeded as follows:
VIRTUS2 SE
on the *_2.fastq
fileVIRTUS.output.tsv
fileGTF
data from the corresponding accession in the NCBI Assembly
database and the corresponding FASTA
genome from NCBI nucleotide
cellranger
custom reference as described*_2.fastq
and using *_1.fastq
as index file (with the appropriate UMI length, etc.)scanpy
Now, the problem is: why doesn't this workflow detect the single genes for each viruses but just detects the whole genomes? For example, my cellranger custom reference GTF is formatted as follows:
viruses_NC_001806.2 RefSeq gene 1 7569 . - . gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_"; db_xref "GeneID:24271498"; gbkey "Gene"; gene "LAT"; gene_biotype "ncRNA"; locus_tag "HHV1gp00s01"; partial "true";
viruses_NC_001806.2 RefSeq transcript 1 7569 . - . gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_unassigned_transcript_1"; gbkey "ncRNA"; gene "LAT"; locus_tag "HHV1gp00s01"; partial "true"; product "LAT"; transcript_biotype "ncRNA";
viruses_NC_001806.2 RefSeq exon 6910 7569 . - . gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_unassigned_transcript_1"; gene "LAT"; locus_tag "HHV1gp00s01"; partial "true"; product "LAT"; transcript_biotype "ncRNA"; exon_number "1";
viruses_NC_001806.2 RefSeq exon 1 4953 . - . gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_unassigned_transcript_1"; gene "LAT"; locus_tag "HHV1gp00s01"; partial "true"; product "LAT"; transcript_biotype "ncRNA"; exon_number "2";
viruses_NC_001806.2 RefSeq gene 513 1540 . + . gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_"; db_xref "GeneID:2703395"; gbkey "Gene"; gene "RL1"; gene_biotype "protein_coding"; locus_tag "HHV1gp00p77";
viruses_NC_001806.2 RefSeq exon 513 1256 . + 0 gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_unassigned_transcript_2"; gbkey "exon"; gene "RL1"; locus_tag "HHV1gp00p77"; product "neurovirulence protein ICP34.5"; protein_id "YP_009137073.1"; exon_number "1";
viruses_NC_001806.2 RefSeq start_codon 513 515 . + 0 gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_unassigned_transcript_2"; gbkey "exon"; gene "RL1"; locus_tag "HHV1gp00p77"; product "neurovirulence protein ICP34.5"; protein_id "YP_009137073.1"; exon_number "1";
viruses_NC_001806.2 RefSeq stop_codon 1257 1259 . + 0 gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_unassigned_transcript_2"; gbkey "exon"; gene "RL1"; locus_tag "HHV1gp00p77"; product "neurovirulence protein ICP34.5"; protein_id "YP_009137073.1"; exon_number "1";
viruses_NC_001806.2 RefSeq gene 2087 5699 . + . gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_"; db_xref "GeneID:2703389"; gbkey "Gene"; gene "RL2"; gene_biotype "protein_coding"; locus_tag "HHV1gp00p76"; note "immediate early gene";
viruses_NC_001806.2 RefSeq transcript 2114 5699 . + . gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "mRNA"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA";
viruses_NC_001806.2 RefSeq exon 2114 2318 . + . gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA"; exon_number "1";
viruses_NC_001806.2 RefSeq exon 3084 3750 . + . gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA"; exon_number "2";
viruses_NC_001806.2 RefSeq exon 3887 5699 . + . gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA"; exon_number "3";
viruses_NC_001806.2 RefSeq exon 2262 2318 . + 0 gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "1";
viruses_NC_001806.2 RefSeq exon 3084 3750 . + 0 gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "2";
viruses_NC_001806.2 RefSeq exon 3887 5487 . + 2 gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "3";
viruses_NC_001806.2 RefSeq start_codon 2262 2264 . + 0 gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "1";
viruses_
is my delimiter for viral genes and GRCH38_
is for human genes. I have changed CDS to exon, otherwise cellranger wouldn't let me proceed. However, after running STAR Solo, the only genes that are recognised are:
'viruses_NC_001499.1',
'viruses_NC_015521.1',
'viruses_NC_004102.1',
'viruses_NC_009823.1',
'viruses_NC_009827.1',
'viruses_NC_002645.1',
'viruses_NC_001806.1',
'viruses_NC_001798.1',
'viruses_NC_009333.1',
'viruses_NC_018711.1',
'viruses_NC_001552.1',
'viruses_NC_010708.1',
'viruses_NC_001672.1',
'viruses_NC_006998.1',
'viruses_gi|9627396|lcl|HPV9REF.1|',
'viruses_gi|396940|lcl|HPV19REF.1|',
'viruses_gi|1020210|lcl|HPV29REF.1|',
'viruses_gi|12084981|lcl|HPV71REF.1|',
'viruses_gi|256807731|lcl|HPV118REF.1|'
Any idea?
I think that your gtf was not correct. The attributes look different from my gtf form for the reference.
chrX HAVANA gene 49250436 49270477 . - . gene_id "ENSG00000049768"; gene_version "15"; gene_type "protein_coding"; gene_name "FOXP3"; level 2; hgnc_id "HGNC:6106"; tag "overlapping_locus"; havana_gene "OTTHUMG00000024135.8";
chrX HAVANA transcript 49250436 49264924 . - . gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";
chrX HAVANA exon 49264661 49264924 . - . gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; exon_number 1; exon_id "ENSE00003849333"; exon_version "1"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";
chrX HAVANA exon 49258296 49258527 . - . gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; exon_number 2; exon_id "ENSE00001316456"; exon_version "2"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";
chrX HAVANA CDS 49258296 49258505 . - 0 gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; exon_number 2; exon_id "ENSE00001316456"; exon_version "2"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";
By the way, I actually prefer taking the whole viral genome as the gene to quantifying each gene for scRNAseq to maximize the detection of viral reads. PRJNA608742 (COVID19 BALF) can be used as the confirmation of the workflow. For your information, I'm attaching my workflow for the dataset.
>SARS-CoV2
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
CCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC
GTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG
CTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT
GCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT
TCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTA
GGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTG
TTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGG
CCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTG
TCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTG
CTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGCAAAGAA
ATTTGACACCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAA
CCAAGGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATCTGTCTATCCAGTTGCGTCAC
CAAATGAATGCAACCAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAACTTCATGGCA
GACGGGCGATTTTGTTAAAGCCACTTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACT
ACTTGTGGTTACTTACCCCAAAATGCTGTTGTTAAAATTTATTGTCCAGCATGTCACAATTCAGAAGTAG
GACCTGAGCATAGTCTTGCCGAATACCATAATGAATCTGGCTTGAAAACCATTCTTCGTAAGGGTGGTCG
CACTATTGCCTTTGGAGGCTGTGTGTTCTCTTATGTTGGTTGCCATAACAAGTGTGCCTATTGGGTTCCA
...
SARS-CoV2 unknown exon 1 29903 . + . gene_id "SARS-CoV2"; transcript_id "SARS-CoV2"; gene_name "SARS-CoV2"; gene_biotype "protein_coding";
made a reference
cellranger mkref --genome=hg38_virus \
--fasta=genome.fa \
--genes=genes.gtf
quantification by cellranger
for SRR in `cat SRR_Acc_List.txt`
do
echo processing ${SRR} ...
if [ ! -f ${SRR}_cr/outs.web_summary.html ]; then
# fasterq-dump
mkdir -p SRR11181954
cp ${SRR}_1.fastq.gz ${SRR}/${SRR}_S1_L001_R1_001.fastq.gz
cp ${SRR}_2.fastq.gz ${SRR}/${SRR}_S1_L001_R2_001.fastq.gz
cellranger count --id=${SRR}_cr \ --fastqs=/home/yyasumizu/NGS_public/PRJNA608742_COVID_BALF_scRNAseq/${SRR} \ --sample=${SRR} \ --transcriptome=/home/yyasumizu/NGS_public/PRJNA608742_COVID_BALF_scRNAseq/cellranger_ref/hg38_virus fi done
5. downstream analysis by seurat
<img width="404" alt="image" src="https://user-images.githubusercontent.com/19543497/190951107-8bb118cb-1e36-42c3-a27a-980067f699e7.png">
<img width="376" alt="image" src="https://user-images.githubusercontent.com/19543497/190951115-27851f7f-1111-4910-ae39-c330a0e56ba8.png">
Hi Yoshi,
The file you show is the original GRCH reference, I was showing the merged reference after the cellranger mkref
command. (I used the gencode.v41.annotation.gtf
file downloaded from NCBI, not sure if it's the same reference you had, as it seems to me the positions are slightly different).
I managed to run the STARSolo command quantifying the single genes per virus, which is what we are actually interested into (the whole virus does not give us enough information). The trick was to avoid genes with the same transcript_id in the GTF file, because the cellranger mkref
command automatically takes only the first occurence of the genes and not the others.
For example, a lot of genes in the gtf files that are available in the NCBI Assembly database do have a gene name but transcript_id marked as "unassigned". I just replaced that "unassigned" with a dummy (i.e. virus accession + an unique id) and they were included in the results of the cellranger mkref
gtf file. After that, the STARSolo pipeline was able to count those genes in the samples as expected.
Hi,
I'm using the reference downloaded at cellranger website, https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest? . Anyway, Thank you for your information! Happy to hear that you succeeded in counting single genes.
@andreanuzzo This is interesting, could you provide the code you used to get VIRTUS2 to work with single cell seq? I found the instructions a bit vague. Thanks!
Hello,
I found the instructions to run VIRTUS2 on scRNA-seq data (https://github.com/yyoshiaki/VIRTUS2#virtus2-for-single-cell-rnaseq). However I am not fully clear on how to proceed.
Basically, I was able to run the
VIRTUS.SE.cwl
on the*_2.fastq.gz
reads. Then I used theVIRTUS.output.tsv
to parse the accession numbers for the detected viruses to build the referenceFASTA
andGTF
files which were needed for thecellranger mkref
command. However, when I runSTARSolo
I don't get any viruses detected (with default values).What should I put in the custom reference? Do I need to use whole genomes per each virus or use the CDS infos from their respective GenBank files? Any reason why you didn't include a STARSolo workflow in the repo?
Thank you for your help.