yyoshiaki / VIRTUS2

A bioinformatics pipeline for viral transcriptome detection and quantification considering splicing.
Other
18 stars 7 forks source link

VIRTUS2 for scRNA #9

Closed andreanuzzo closed 2 years ago

andreanuzzo commented 2 years ago

Hello,

I found the instructions to run VIRTUS2 on scRNA-seq data (https://github.com/yyoshiaki/VIRTUS2#virtus2-for-single-cell-rnaseq). However I am not fully clear on how to proceed.

Basically, I was able to run the VIRTUS.SE.cwl on the *_2.fastq.gz reads. Then I used the VIRTUS.output.tsv to parse the accession numbers for the detected viruses to build the reference FASTA and GTF files which were needed for the cellranger mkref command. However, when I run STARSolo I don't get any viruses detected (with default values).

What should I put in the custom reference? Do I need to use whole genomes per each virus or use the CDS infos from their respective GenBank files? Any reason why you didn't include a STARSolo workflow in the repo?

Thank you for your help.

yyoshiaki commented 2 years ago

Hi,

Did you try cellranger with the custom reference? You can use full sequences as described in the section, Add a marker gene to the FASTA and GTF. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr

I also have experienced that most viral reads were from ambient RNA from empty droplets rather than droplets allocated as cells.

andreanuzzo commented 2 years ago

Thank you for your comments.

I managed to build the custom reference with cellranger (I can send the commands if needed). I proceeded as follows:

Now, the problem is: why doesn't this workflow detect the single genes for each viruses but just detects the whole genomes? For example, my cellranger custom reference GTF is formatted as follows:

viruses_NC_001806.2 RefSeq  gene    1   7569    .   -   .   gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_"; db_xref "GeneID:24271498"; gbkey "Gene"; gene "LAT"; gene_biotype "ncRNA"; locus_tag "HHV1gp00s01"; partial "true";
viruses_NC_001806.2 RefSeq  transcript  1   7569    .   -   .   gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_unassigned_transcript_1"; gbkey "ncRNA"; gene "LAT"; locus_tag "HHV1gp00s01"; partial "true"; product "LAT"; transcript_biotype "ncRNA";
viruses_NC_001806.2 RefSeq  exon    6910    7569    .   -   .   gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_unassigned_transcript_1"; gene "LAT"; locus_tag "HHV1gp00s01"; partial "true"; product "LAT"; transcript_biotype "ncRNA"; exon_number "1";
viruses_NC_001806.2 RefSeq  exon    1   4953    .   -   .   gene_id "viruses_HHV1gp00s01"; transcript_id "viruses_unassigned_transcript_1"; gene "LAT"; locus_tag "HHV1gp00s01"; partial "true"; product "LAT"; transcript_biotype "ncRNA"; exon_number "2";
viruses_NC_001806.2 RefSeq  gene    513 1540    .   +   .   gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_"; db_xref "GeneID:2703395"; gbkey "Gene"; gene "RL1"; gene_biotype "protein_coding"; locus_tag "HHV1gp00p77";
viruses_NC_001806.2 RefSeq  exon    513 1256    .   +   0   gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_unassigned_transcript_2"; gbkey "exon"; gene "RL1"; locus_tag "HHV1gp00p77"; product "neurovirulence protein ICP34.5"; protein_id "YP_009137073.1"; exon_number "1";
viruses_NC_001806.2 RefSeq  start_codon 513 515 .   +   0   gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_unassigned_transcript_2"; gbkey "exon"; gene "RL1"; locus_tag "HHV1gp00p77"; product "neurovirulence protein ICP34.5"; protein_id "YP_009137073.1"; exon_number "1";
viruses_NC_001806.2 RefSeq  stop_codon  1257    1259    .   +   0   gene_id "viruses_HHV1gp00p77"; transcript_id "viruses_unassigned_transcript_2"; gbkey "exon"; gene "RL1"; locus_tag "HHV1gp00p77"; product "neurovirulence protein ICP34.5"; protein_id "YP_009137073.1"; exon_number "1";
viruses_NC_001806.2 RefSeq  gene    2087    5699    .   +   .   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_"; db_xref "GeneID:2703389"; gbkey "Gene"; gene "RL2"; gene_biotype "protein_coding"; locus_tag "HHV1gp00p76"; note "immediate early gene";
viruses_NC_001806.2 RefSeq  transcript  2114    5699    .   +   .   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "mRNA"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA";
viruses_NC_001806.2 RefSeq  exon    2114    2318    .   +   .   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA"; exon_number "1";
viruses_NC_001806.2 RefSeq  exon    3084    3750    .   +   .   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA"; exon_number "2";
viruses_NC_001806.2 RefSeq  exon    3887    5699    .   +   .   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; transcript_biotype "mRNA"; exon_number "3";
viruses_NC_001806.2 RefSeq  exon    2262    2318    .   +   0   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "1";
viruses_NC_001806.2 RefSeq  exon    3084    3750    .   +   0   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "2";
viruses_NC_001806.2 RefSeq  exon    3887    5487    .   +   2   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "3";
viruses_NC_001806.2 RefSeq  start_codon 2262    2264    .   +   0   gene_id "viruses_HHV1gp00p76"; transcript_id "viruses_unassigned_transcript_3"; gbkey "exon"; gene "RL2"; locus_tag "HHV1gp00p76"; product "ubiquitin E3 ligase ICP0"; protein_id "YP_009137074.1"; exon_number "1";

viruses_ is my delimiter for viral genes and GRCH38_ is for human genes. I have changed CDS to exon, otherwise cellranger wouldn't let me proceed. However, after running STAR Solo, the only genes that are recognised are:

'viruses_NC_001499.1',
 'viruses_NC_015521.1',
 'viruses_NC_004102.1',
 'viruses_NC_009823.1',
 'viruses_NC_009827.1',
 'viruses_NC_002645.1',
 'viruses_NC_001806.1',
 'viruses_NC_001798.1',
 'viruses_NC_009333.1',
 'viruses_NC_018711.1',
 'viruses_NC_001552.1',
 'viruses_NC_010708.1',
 'viruses_NC_001672.1',
 'viruses_NC_006998.1',
 'viruses_gi|9627396|lcl|HPV9REF.1|',
 'viruses_gi|396940|lcl|HPV19REF.1|',
 'viruses_gi|1020210|lcl|HPV29REF.1|',
 'viruses_gi|12084981|lcl|HPV71REF.1|',
 'viruses_gi|256807731|lcl|HPV118REF.1|'

Any idea?

yyoshiaki commented 2 years ago

I think that your gtf was not correct. The attributes look different from my gtf form for the reference.

chrX    HAVANA  gene    49250436        49270477        .       -       .       gene_id "ENSG00000049768"; gene_version "15"; gene_type "protein_coding"; gene_name "FOXP3"; level 2; hgnc_id "HGNC:6106"; tag "overlapping_locus"; havana_gene "OTTHUMG00000024135.8";
chrX    HAVANA  transcript      49250436        49264924        .       -       .       gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";
chrX    HAVANA  exon    49264661        49264924        .       -       .       gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; exon_number 1; exon_id "ENSE00003849333"; exon_version "1"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";
chrX    HAVANA  exon    49258296        49258527        .       -       .       gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; exon_number 2; exon_id "ENSE00001316456"; exon_version "2"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";
chrX    HAVANA  CDS     49258296        49258505        .       -       0       gene_id "ENSG00000049768"; gene_version "15"; transcript_id "ENST00000376207"; transcript_version "9"; gene_type "protein_coding"; gene_name "FOXP3"; transcript_type "protein_coding"; transcript_name "FOXP3-203"; exon_number 2; exon_id "ENSE00001316456"; exon_version "2"; level 2; protein_id "ENSP00000365380.4"; transcript_support_level "1"; hgnc_id "HGNC:6106"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS14323.1"; havana_gene "OTTHUMG00000024135.8"; havana_transcript "OTTHUMT00000060814.2";

By the way, I actually prefer taking the whole viral genome as the gene to quantifying each gene for scRNAseq to maximize the detection of viral reads. PRJNA608742 (COVID19 BALF) can be used as the confirmation of the workflow. For your information, I'm attaching my workflow for the dataset.

image

  1. added SARS-CoV2 genome to genome.fa (refdata-gex-GRCh38-2020-A)
>SARS-CoV2
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
CCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC
GTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG
CTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT
GCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT
TCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTA
GGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTG
TTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGG
CCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTG
TCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTG
CTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGCAAAGAA
ATTTGACACCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAA
CCAAGGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATCTGTCTATCCAGTTGCGTCAC
CAAATGAATGCAACCAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAACTTCATGGCA
GACGGGCGATTTTGTTAAAGCCACTTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACT
ACTTGTGGTTACTTACCCCAAAATGCTGTTGTTAAAATTTATTGTCCAGCATGTCACAATTCAGAAGTAG
GACCTGAGCATAGTCTTGCCGAATACCATAATGAATCTGGCTTGAAAACCATTCTTCGTAAGGGTGGTCG
CACTATTGCCTTTGGAGGCTGTGTGTTCTCTTATGTTGGTTGCCATAACAAGTGTGCCTATTGGGTTCCA
...
  1. added whole genome region of SARS-CoV2 to genes.gtf
SARS-CoV2       unknown exon    1       29903   .       +       .       gene_id "SARS-CoV2"; transcript_id "SARS-CoV2"; gene_name "SARS-CoV2"; gene_biotype "protein_coding";
  1. made a reference

    cellranger mkref --genome=hg38_virus \
    --fasta=genome.fa \
    --genes=genes.gtf
  2. quantification by cellranger

    
    for SRR in `cat SRR_Acc_List.txt`
    do
    echo processing ${SRR} ...
    if [ ! -f ${SRR}_cr/outs.web_summary.html ]; then
    # fasterq-dump
    mkdir -p SRR11181954
    cp ${SRR}_1.fastq.gz ${SRR}/${SRR}_S1_L001_R1_001.fastq.gz
    cp ${SRR}_2.fastq.gz ${SRR}/${SRR}_S1_L001_R2_001.fastq.gz

cellranger count --id=${SRR}_cr \ --fastqs=/home/yyasumizu/NGS_public/PRJNA608742_COVID_BALF_scRNAseq/${SRR} \ --sample=${SRR} \ --transcriptome=/home/yyasumizu/NGS_public/PRJNA608742_COVID_BALF_scRNAseq/cellranger_ref/hg38_virus fi done


5. downstream analysis by seurat

<img width="404" alt="image" src="https://user-images.githubusercontent.com/19543497/190951107-8bb118cb-1e36-42c3-a27a-980067f699e7.png">
<img width="376" alt="image" src="https://user-images.githubusercontent.com/19543497/190951115-27851f7f-1111-4910-ae39-c330a0e56ba8.png">
andreanuzzo commented 2 years ago

Hi Yoshi,

The file you show is the original GRCH reference, I was showing the merged reference after the cellranger mkref command. (I used the gencode.v41.annotation.gtf file downloaded from NCBI, not sure if it's the same reference you had, as it seems to me the positions are slightly different).

I managed to run the STARSolo command quantifying the single genes per virus, which is what we are actually interested into (the whole virus does not give us enough information). The trick was to avoid genes with the same transcript_id in the GTF file, because the cellranger mkref command automatically takes only the first occurence of the genes and not the others.

For example, a lot of genes in the gtf files that are available in the NCBI Assembly database do have a gene name but transcript_id marked as "unassigned". I just replaced that "unassigned" with a dummy (i.e. virus accession + an unique id) and they were included in the results of the cellranger mkref gtf file. After that, the STARSolo pipeline was able to count those genes in the samples as expected.

yyoshiaki commented 2 years ago

Hi,

I'm using the reference downloaded at cellranger website, https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest? . Anyway, Thank you for your information! Happy to hear that you succeeded in counting single genes.

cstrlln commented 1 year ago

@andreanuzzo This is interesting, could you provide the code you used to get VIRTUS2 to work with single cell seq? I found the instructions a bit vague. Thanks!