Open kforner opened 10 years ago
Hi again, Thinking about it, IMHO, for sake of reproducible results and simplicity, I would rather impose for snpgds/GenotypeData this rule: Allele1 < Allele2 lexicographically. The problem with major/minor is that it can no longer be true for a subset of samples. The problem is when you try to compare genotypes using integer codes, it is quite difficult if in one dataset the Allele1 is not the same than in the other. If you agree with this rule, then a check function could be implemented in SnpRelate, and all functions producing snpds files should sort the alleles and recode genotypes accordingly. What do you think ?
I am afraid I do not agree that Allele1 should always <= Allele2 in SNPRelate.
In practice, when we import SNP data from Illumina raw text files (chip or sequencing) instead of PLINK files, we define Allele1 is the reference allele on the forward strand, so that it is easy to merge or compare different datasets.
But how do you do to compare and merge datasets for which some alleles are inversed then ?
The function snpgdsCombineGeno
compares snp.allele
to determine whether the alleles are inverted. However, sometimes some alleles are from the reverse strand not the forward strand, which could cause an issue of strand ambiguity. In this case, snpgdsCombineGeno
decides the order of alleles by comparing the allele frequency.
Therefore, in order to avoid any ambiguity, it is the best to clearly define Allele1/Allele2 at the beginning.
I agree, but I do not think that coding the strand with the order of the alleles is the right approach. There are examples where major/minor alleles are inversed in sub datasets (simpson paradox). In this case snpgdsCombineGeno is fooled. IMHO it would be more robust and understandable to code this strand information as an additional variable.
I think you have missed something:
I have checked the source code carefully. I change coding from PLINK BED to GDS correctly at SNPRelate.cpp (line 877 to 911):
const C_UInt8 cvt[4] = { 2, 3, 1, 0 };
00 Homozygous for first allele in .bim file --> 2 in GDS 01 Missing genotype --> 3 in GDS 10 Heterozygous --> 1 in GDS 11 Homozygous for second allele in .bim file --> 0 in GDS
the number in GDS refers to the number of first allele (reference allele).
I did not say the coding was wrong, just that the alleles ordering did not reflect the intention : in snpgds allele1 is the reference allele, though in .bim allele1 is the minor allele. So after the conversion, all first alleles from the snpgds are minor alleles, which is not exactly what is understood (at least by me) from the documentation. Once again I strongly believe that some enforced standard should be less error-prone (like allele1 < allele2 + additional variables, e.g. strand). Thanks
Hi, Related to this topic, when converting VCF to gds (snpgdsVCF2GDS), I would like some clarification on how genotype dose is coded. From the R code: geno.str <- c("0|0", "0|1", "1|0", "1|1", "0/0", "0/1", "1/0", "1/1", "0", "1", "0|0|0", "0|0|1", "0|1|0", "0|1|1", "1|0|0", "1|0|1", "1|1|0", "1|1|1", "0/0/0", "0/0/1", "0/1/0", "0/1/1", "1/0/0", "1/0/1", "1/1/0", "1/1/1") geno.code <- as.integer(c(2, 1, 1, 0, 2, 1, 1, 0, 1, 0, 2, 1, 1, 1, 1, 1, 1, 0, 2, 1, 1, 1, 1, 1, 1, 0)) My understanding is that dose should represent dose of alternate/non-reference allele, however, this looks like reference allele (0|0) is coded as 2. Do I just have this incorrect?
Many thanks.
We use 2 bits to store a SNP genotype, which is defined as the count of the reference allele. So "(0|0)" should be coded as 2, since "0" represents reference alleles.
If you are interested in specifying which allele is reference allele, please consider the argument ref.allele
in the function snpgdsVCF2GDS
.
Has this issue been solved?
I met the same problem as kforner recently. I just wonder whether GDS A1/A2 and PED A1/A2 are consistent now.
I got gds files from other person. After I convert gds to plink bed with snpgdsGDS2BED function, I found the result swaped A1 and A2. Is there a way to appoint which one is REF allele when using snpgdsGDS2BED?
Hi, According to the doc of snpgdsCreateGeno: snp.allele{the reference/non-reference alleles} and in snpgdsBED2GDS, line 298:
But according plink2 documentation (cf https://www.cog-genomics.org/plink2/formats#bim): Allele 1 (corresponding to clear bits in .bed; usually minor) Allele 2 (corresponding to set bits in .bed; usually major)
So it seems that you should swap those alleles before writing them, or I missed something.