zstephens / neat-genreads

NEAT read simulation tools
Other
92 stars 27 forks source link

Question about mutation models #60

Open XavierMialhe opened 5 years ago

XavierMialhe commented 5 years ago

Hello,

I would like create a dataset for benchmark somatic variant caller with your software. Thanks for this clear documentation and your well-designed tool ! I just don't fully understand the generation of mutation model from provided file.

I have already spike somatic mutations into a BAM with BAMSurgeon and got a VCF (linked at the end) with controlled VAF (between 1% and 30%). If I use this VCF as input of simulation, all variants turn to germline variant with a VAF close to 50% (check with IGV).

How should I proceed to spike in my tumor bam, the same somatic variants with the same frequency of my VCF?

truth_mark_sorted.zip

zstephens commented 5 years ago

Greetings,

The read simulator was designed around drawing reads from a ground-truth set of alleles, such that it could be used for benchmarking and assessing variant phasing algorithms in addition to standard variant calling pipelines.

As a result, there doesn't quite exist a notion of arbitrary variant-allele frequency for this tool, all variation it works with must belong to one of the phased copies of the reference sequences that it's sampling reads from, under the hood. E.g. if you specified a ploidy of 2 everything would be (0, 0.5, 1.0), ploidy 3 would be (0, 0.33, 0.66, 1.0), etc.

See this post for some possible ways of creating somatic datasets with this tool: https://github.com/zstephens/neat-genreads/issues/55#issuecomment-461495062

Hope this helps!