williamslab / ped-sim

Pedigree simulator
GNU General Public License v3.0
25 stars 11 forks source link

VCF out of order or chomosome not present #13

Closed stephenturner closed 3 years ago

stephenturner commented 3 years ago

Hello! Thanks for making this open-source and for putting together such an informative README. I'm still wrapping my brain around branching and the pedigree def file, but for now, I'm using your examples trying to simulate genetic data from a VCF file, and I'm getting an unexpected error about chromosomes in my VCF either out of order or not present.

I first created the map file exactly as specified in the README, named it the same, refined_mf.simmap. My ped def file (sims.def) file is pretty simple:

# test definition
def testped 5 3
1 1
2 1
3 1

When not using an input VCF I don't run into any issue:

$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv  -o sims/test-gbr
Pedigree simulator!  v1.1.16    (Released  8 Feb 2021)

  Def file:             sims.def
  Map file:             refined_mf.simmap
  Input VCF:            [none: no genetic data]
  Output prefix:        sims/test-gbr

  Random seed:          2550799409

  Interference file:    ../nu_p_campbell.tsv

Simulating haplotype transmissions... done.
Printing IBD segments... done.

To simulate genetic data, must use an input VCF with 20 founders.

However, when I do try to simulate variants using an input VCF, I run into a problem:

$ ped-sim -d sims.def -m refined_mf.simmap --intf ../nu_p_campbell.tsv -i gbr.22.snps.vcf.gz -o sims/test-gbr
Pedigree simulator!  v1.1.16    (Released  8 Feb 2021)

  Def file:             sims.def
  Map file:             refined_mf.simmap
  Input VCF:            gbr.22.snps.vcf.gz
  Output prefix:        sims/test-gbr

  Random seed:          4250646984

  Interference file:    ../nu_p_campbell.tsv

  Genotype error rate:  1.0e-03
  Opposite homozygous error rate:       0.00
  Missingness rate:     1.0e-03
  Pseudo-haploid rate:  0

  Not retaining extra samples in output VCF (printing only simulated samples)
  Output VCF will contain unphased data

Simulating haplotype transmissions... done.
Printing IBD segments... done.
Reading input VCF meta data... done.
  Input contains 40 samples, using 20 as founders, and retaining 0
Generating VCF file... ERROR: chromosome 22 in VCF file either out of order or not present
       in genetic map

Prior to running this I checked that there were no SNPs in the VCF outside the range defined in the map. I created the VCF with minimal pre-processing starting from the GRCh37 1000 Genomes chromosome 22 VCF. I first collected sample IDs for 20 unrelated samples from GBR, then subsetted the VCF getting only those samples, only variants on chromosome 22 between the region 17152611-51175626, and the -v snps -m2 -M2 -i 'INFO/AF>0.05' gets only biallelic SNPs with a global MAF>0.05, then finally sorts the output and writes a new VCF.

bcftools view ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    --regions 22:17152611-51175626 \
    --samples 40-gbr.txt \
    -v snps -m2 -M2 -i 'INFO/AF>0.05' \
    | bcftools sort -Oz -o gbr.22.snps.vcf.gz

Here's that VCF in case you want it for a reproducible example:

gbr.22.snps.vcf.gz

Looking at the first few and last few chromosome 22 entries on the sex-specific map:

$ grep ^22 refined_mf.simmap | sort -nk2 | head -n2
22      17087705        0       0
22      17152611        0.003573334924  0.0023823617488

$ grep ^22 refined_mf.simmap | sort -nk2 | tail -n2
22      51175626        61.01549487409  76.3046256962048
22      51229805        61.023018820178 76.3068982668672

... shows that subsetting the vcf to the region 22:17152611-51175626 shouldn't result in any variants outside the range defined by the map. Indeed, checking the VCF itself shows that the first record is at position 17152611, and the last is at position 51175626, both included in the sex-specific map (refined_mf.simmap).

$ bcftools view --no-header gbr.22.snps.vcf.gz | head -n1 | cut -f1-6
22      17152611        rs4819849       A       G       100

$ bcftools view --no-header gbr.22.snps.vcf.gz | tail -n1 | cut -f1-6
22      51175626        rs3810648       A       G       100

Thanks, and please let me know if I can help with reproducing this issue on your end!

stephenturner commented 3 years ago

Here's the sex-specific map I'm using (compressed)

refined_mf.simmap.gz

williamslab commented 3 years ago

Thanks for posting! If you remove all chromosomes except 22 from the refined_mf.simmap file, it will work. Though this can change, for now, though you can omit later chromosomes, Ped-sim requires the VCF's first and successive chromosomes to match the first and later chromosomes in the genetic map. So it's looking for chromosome 1 here.

williamslab commented 3 years ago

Note that when using sex-specific maps, simulating all chromosomes simultaneously is necessary to ensure that the parent sexes are the same on all chromosomes, so if you can, try to merge the autosomal data into one VCF. (Support for chrX is coming hopefully in March.)

stephenturner commented 3 years ago

Thanks for the quick reply! Removing non chr22 from that map should be an easy fix.

Regarding the second - my initial intent was to simulate only a single chromosome's worth of data, not simulating genome-wide data one chromosome at a time. That should still work, right?

williamslab commented 3 years ago

For sure, should work! Hope the documentation helps.