ylab-hi / ScanNeo2

Snakemake-based computational workflow for neoantigen prediction from diverse sources
MIT License
10 stars 1 forks source link

Calling the long non indel from both DNAseq and RNAseq bam file #22

Closed nttg8100 closed 1 month ago

nttg8100 commented 1 month ago

I reviewed the publication: https://academic.oup.com/bioinformatics/article/39/11/btad659/7330407 I think that the pipeline used both bam files from RNAseq and DNAseq to call for the long indel. The bam file from the figure shows me it is only from the RNAseq bam file. If there is anything wrong, please let's me know.

riasc commented 1 month ago

Hi,

yes this is a feature we added later (after publicaton) to check for shared neoantigens between DNA- and RNA-seq data. Thats why its not in the figure. If you don't want this you can control it using https://github.com/ylab-hi/ScanNeo2/blob/52a3818ec3189af502f18eba6b6a1a69b9b3a8c3/config/config.yaml#L75

nttg8100 commented 1 month ago

Thank you for your sharing. Did you test on the TESLA dataset. I wonder how it improves by adding the DNA variant calling. Does the variants on the DNAseq finally fulfill the last neoantigen that you missed in the publication (37/38) ?

riasc commented 1 month ago

Hi,

I couldn't find the last neoantigens from TESLA. But this is mainly because the TESLA dataset does not provide the patient10 lung cancer sample. Maybe the patient consent wasn't there. If you look at the results from the publication you can see that patient10 includes three immunogenic neoantigens. Interestingly, we detect one of it using the more tolerant settings. So its probably the best we could do.

However, variants of DNA-seq definitely helps to filter out false positives when only considering the ones found in both. I'm also currently working on v0.3 which provides more values to filter (like sequence similarity) which hopefully improves the accuracy more.

Cheers

nttg8100 commented 1 month ago

It is good to know that you have worked on the v3 version. Beside, do you think that the indel calling subworkflow that can be improved. I saw that on the pipeline you use your own softwares using python on your lab. I have not seen any publication about these softwares for benchmarking the indel calling pipeline. If you have, can you share with me ?

riasc commented 1 month ago

Yes it definitely can be improved. For the long indel calling >= 10 we are using transindel. You can see some benchmarks in the original publication of transIndel. And GATK for small indels and SNVs. Here, we follow best practices in using first use the haplotypecaller to detect indels, identify most significant ones and then use this information for variant calling with mutect2. But we are not using the germline variants for anything else at the moment. Also using multiple variant callers or even incorporating SVs could enhance the workflow.