Normal Sample - Githubissues

LucaMannino commented 3 months ago

Hello,

I have a question regarding the use of normal samples in ScanNeo2.

According to the information provided in the data section of the wiki, it states: "In addition, normal allows to specify normal samples but is not used currently. Multiple normal samples can be separated by spaces."

How does the software ensure that the identified mutations are somatic if variant calling for normal DNA sequencing is not being performed?

I recently completed a test run with my data, and it appears that no output was generated for the normal DNA data. I suspect that it's either not using normal dna data or I may have incorrectly configured the config file. Below is how I filled in the config file in the data section, could you please confirm if this is correct?

data: name: bN4 dnaseq: dna_tumor: Path/to/File/bN4DNA_EKDN230058818-1A_H2CMHDSXC_L3_1.fq.gz Path/to/File/bN4DNA_EKDN230058818-1A_H2CMHDSXC_L3_2.fq.gz rnaseq: rna_tumor: Path/to/File/bN4_1.fastq.gz Path/to/File/bN4_2.fastq.gz normal: dna_normal1: Path/to/File/bN4DNAbl_EKDN230058820-1A_222T7MLT4_L1_1.fq.gz Path/to/File/bN4DNAbl_EKDN230058820-1A_222T7MLT4_L1_2.fq.gz dna_normal2: Path/to/File/bN4DNAbl_EKDN230058820-1A_H27YKDSXC_L4_1.fq.gz Path/to/File/bN4DNAbl_EKDN230058820-1A_H27YKDSXC_L4_2.fq.gz

riasc commented 3 months ago

Hi, So, when no normal sample is provided, we depend a bit on the output of the used tools. For example, GATK has the tumor-only mode, and we use filtermutectcalls, which include a Panel of Normals to remove false positives.

In any case it should generate an output (but also SNVs/indels - as the other sources are generated from transcriptomic data). Your config looks good. The only thing is that you could specify normal: dna_normal1 dna_normal2 to tell it that these are the normal samples. These are for example excluded in the genotyping.

However, it should print an output. Does it not generate any output? Like the alignment for example? Also what mode have you specified on the indel module? https://github.com/ylab-hi/ScanNeo2/blob/adac49d2f96c23a8767a58ce76df3fd536e43618/config/config.yaml#L75 This needs to be set of BOTH or DNA, otherwise it will only be activate for the RNA samples.

LucaMannino commented 3 months ago

Hi, thank you for the prompt reply. On line 75 of config.yaml I have selected mode: BOTH It does generate an output but only for the tumor samples it doesn't include any alignment analysis for the normal samples, it looks as it is currently not using the normal sample to filter the the germline mutations but only the Panel of Normals to remove false positives.

/dnaseq/reads$ ls dna_tumor_preproc_failed.fq.gz dna_tumor_preproc_report.json dna_tumor_R1_preproc_unpaired.fq.gz dna_tumor_R2_preproc_unpaired.fq.gz dna_tumor_preproc_report.html dna_tumor_R1_preproc.fq.gz dna_tumor_R2_preproc.fq.gz

dnaseq/align$ ls dna_tumor_aligned_BWA.bam dna_tumor_final_BWA.bam dna_tumor_final_BWA.bam.bai dna_tumor_final_BWA_split

could it be that I need to rewrite the data portion of the config.yaml file to:

data: name: bN4 dnaseq: dna_tumor: Path/to/File/bN4DNA_EKDN230058818-1A_H2CMHDSXC_L3_1.fq.gz Path/to/File/bN4DNA_EKDN230058818-1A_H2CMHDSXC_L3_2.fq.gz rnaseq: rna_tumor: Path/to/File/bN4_1.fastq.gz Path/to/File/bN4_2.fastq.gz normal: dna_normal1 Path/to/File/bN4DNAbl_EKDN230058820-1A_222T7MLT4_L1_1.fq.gz Path/to/File/bN4DNAbl_EKDN230058820-1A_222T7MLT4_L1_2.fq.gz dna_normal2 Path/to/File/bN4DNAbl_EKDN230058820-1A_H27YKDSXC_L4_1.fq.gz Path/to/File/bN4DNAbl_EKDN230058820-1A_H27YKDSXC_L4_2.fq.gz

Did I interpret the "The only thing is that you could specify normal: dna_normal1 dna_normal2 to tell it that these are the normal samples." of your reply correctly?

riasc commented 3 months ago

Hi,

Ah, so I think I misread your post before. Sorry about that. In normal, you only need to provide the identifier (e.g., dna_normal) as it has been defined. So, when you define dna_normal samples you have to put them under dnaseq.

# General settings
reference:
  release: 111
  nonchr: false
threads: 30
mapq: 30  # overall required mapping quality
basequal: 20  # overall required base quality 

data:
  name:  bN4
  dnaseq: 
    dna_normal1: Path/to/File/bN4DNAbl_EKDN230058820-1A_222T7MLT4_L1_1.fq.gz  Path/to/File/bN4DNAbl_EKDN230058820-1A_222T7MLT4_L1_2.fq.gz
    dna_normal2: Path/to/File/bN4DNAbl_EKDN230058820-1A_H27YKDSXC_L4_1.fq.gz Path/to/File/bN4DNAbl_EKDN230058820-1A_H27YKDSXC_L4_2.fq.gz
    dna_tumor: Path/to/File/bN4DNA_EKDN230058818-1A_H2CMHDSXC_L3_1.fq.gz Path/to/File/bN4DNA_EKDN230058818-1A_H2CMHDSXC_L3_2.fq.gz
  rnaseq:
    rna_tumor: Path/to/File/bN4_1.fastq.gz Path/to/File/bN4_2.fastq.gz
  normal: dna_normal1 dna_normal2

  custom:
    variants:
    hlatyping:
      MHC-I:
      MHC-II:

For example, when we do the indel calling, ScanNeo2 searches for the keys under rnaseq/dnaseq and uses the data that is defined under them.

Like here: https://github.com/ylab-hi/ScanNeo2/blob/adac49d2f96c23a8767a58ce76df3fd536e43618/workflow/rules/common.smk#L612

Since it was defined on the same level, it was probably missed. I will think about some routines to catch this. Thanks for the hint.

In the final prioritization file (mhc-I_neoepitopes.txt), you should also find neoantigens with the dna_normal group (in the group field). If you redo the analysis, you might need to delete some intermediate files, like in the annotation/variants directories, as Snakemake works bottom up and only checks if the file is present (regardless of how it was generated—and in your case, only done for the tumor samples).

Let me know if this helps, Thanks

ylab-hi / ScanNeo2

Normal Sample #33