yuchaojiang / MARATHON

Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny
GNU General Public License v2.0
20 stars 5 forks source link

Input for CODEX2 and Get bias matrix #7

Closed mathisnozais closed 8 months ago

mathisnozais commented 8 months ago

Hi there ! First I really would like to thank you for your tools and the documentation about this pipeline that is exactly was I was looking for. When following the "4.4.2. Running CODEX2 to get bias matrix" I'm not really sure what input should I use to get the three matrixes. I'm a bit lost, I think for the matrix of SNP we start with position coming from a somatic variant call VCF ? For the two other matrixes the tumoral part has to come from a somatic variant call VCF but for the Normal part is it extracted from a Germline variant call VCF or from the paired T/N somatic variant call VCF with the info regarding the normal pair ? I might have some other question regarding those input but those are the main ones and I'm pretty sure there is something I don't understand quite well. Thank you for your kind help, Mathis

yuchaojiang commented 8 months ago

Hi,

The three matrices are: 1) bed files for the SNPs 2) allelic coverage for the SNPs 3) genotype for the SNPs.

To run CODEX2, it needs a matrix of total coverage at each SNP (ref allele + alt allele) as well as the location of the SNPs (so that a GC content of a region centered at the SNP can be calculated).

mathisnozais commented 8 months ago

Hi, I'm sorry but I'm still having difficulty to understand everything. To get used to CODEX2 I run it like in your Codex2 Github demo with multiple control samples. It worked perfectly fine, but when getting back to MARATHON for the bias matrix I'm a bit lost. For the bed files if I understand correctly we need a file with combined SNP position from the different samples. For matrices 2 & 3 I might get something wrong but how do you extract allelic coverage and genotype for all SNP given in matrix 1 ? Since each VCF only contain information for their specific SNP. For Allelic coverage I found "CollectAllelicCounts" from GATK but I don't know if it's the right way.

Thanks a lot for your help !

yuchaojiang commented 8 months ago

Hi, This is unfortunately outside the domain for MARATHON, but in the VCF file, it will have, for each mutation, a genotype slot and an allelic coverage slot for both the reference allele and the alternative allele. I believe we have an example to extract them in R but your VCF might have a different output format. You can check more on the VCF file format.