zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
98 stars 25 forks source link

alleles switch in snpgdsBED2GDS ? #4

Open kforner opened 10 years ago

kforner commented 10 years ago

Hi, According to the doc of snpgdsCreateGeno: snp.allele{the reference/non-reference alleles} and in snpgdsBED2GDS, line 298:

bimD <- read.table(bim.fn, header=FALSE, stringsAsFactors=FALSE)
names(bimD) <- c("chr", "snp.id", "map", "pos", "allele1", "allele2")

But according plink2 documentation (cf https://www.cog-genomics.org/plink2/formats#bim): Allele 1 (corresponding to clear bits in .bed; usually minor) Allele 2 (corresponding to set bits in .bed; usually major)

So it seems that you should swap those alleles before writing them, or I missed something.

kforner commented 10 years ago

Hi again, Thinking about it, IMHO, for sake of reproducible results and simplicity, I would rather impose for snpgds/GenotypeData this rule: Allele1 < Allele2 lexicographically. The problem with major/minor is that it can no longer be true for a subset of samples. The problem is when you try to compare genotypes using integer codes, it is quite difficult if in one dataset the Allele1 is not the same than in the other. If you agree with this rule, then a check function could be implemented in SnpRelate, and all functions producing snpds files should sort the alleles and recode genotypes accordingly. What do you think ?

zhengxwen commented 10 years ago

I am afraid I do not agree that Allele1 should always <= Allele2 in SNPRelate.

In practice, when we import SNP data from Illumina raw text files (chip or sequencing) instead of PLINK files, we define Allele1 is the reference allele on the forward strand, so that it is easy to merge or compare different datasets.

kforner commented 10 years ago

But how do you do to compare and merge datasets for which some alleles are inversed then ?

zhengxwen commented 10 years ago

The function snpgdsCombineGeno compares snp.allele to determine whether the alleles are inverted. However, sometimes some alleles are from the reverse strand not the forward strand, which could cause an issue of strand ambiguity. In this case, snpgdsCombineGeno decides the order of alleles by comparing the allele frequency.

Therefore, in order to avoid any ambiguity, it is the best to clearly define Allele1/Allele2 at the beginning.

kforner commented 10 years ago

I agree, but I do not think that coding the strand with the order of the alleles is the right approach. There are examples where major/minor alleles are inversed in sub datasets (simpson paradox). In this case snpgdsCombineGeno is fooled. IMHO it would be more robust and understandable to code this strand information as an additional variable.

zhengxwen commented 10 years ago

I think you have missed something:

I have checked the source code carefully. I change coding from PLINK BED to GDS correctly at SNPRelate.cpp (line 877 to 911):

const C_UInt8 cvt[4] = { 2, 3, 1, 0 };

00 Homozygous for first allele in .bim file --> 2 in GDS 01 Missing genotype --> 3 in GDS 10 Heterozygous --> 1 in GDS 11 Homozygous for second allele in .bim file --> 0 in GDS

the number in GDS refers to the number of first allele (reference allele).

kforner commented 10 years ago

I did not say the coding was wrong, just that the alleles ordering did not reflect the intention : in snpgds allele1 is the reference allele, though in .bim allele1 is the minor allele. So after the conversion, all first alleles from the snpgds are minor alleles, which is not exactly what is understood (at least by me) from the documentation. Once again I strongly believe that some enforced standard should be less error-prone (like allele1 < allele2 + additional variables, e.g. strand). Thanks

awalsh17 commented 9 years ago

Hi, Related to this topic, when converting VCF to gds (snpgdsVCF2GDS), I would like some clarification on how genotype dose is coded. From the R code: geno.str <- c("0|0", "0|1", "1|0", "1|1", "0/0", "0/1", "1/0", "1/1", "0", "1", "0|0|0", "0|0|1", "0|1|0", "0|1|1", "1|0|0", "1|0|1", "1|1|0", "1|1|1", "0/0/0", "0/0/1", "0/1/0", "0/1/1", "1/0/0", "1/0/1", "1/1/0", "1/1/1") geno.code <- as.integer(c(2, 1, 1, 0, 2, 1, 1, 0, 1, 0, 2, 1, 1, 1, 1, 1, 1, 0, 2, 1, 1, 1, 1, 1, 1, 0)) My understanding is that dose should represent dose of alternate/non-reference allele, however, this looks like reference allele (0|0) is coded as 2. Do I just have this incorrect?

Many thanks.

zhengxwen commented 9 years ago

We use 2 bits to store a SNP genotype, which is defined as the count of the reference allele. So "(0|0)" should be coded as 2, since "0" represents reference alleles.

If you are interested in specifying which allele is reference allele, please consider the argument ref.allele in the function snpgdsVCF2GDS.

yangli-ai commented 5 years ago

Has this issue been solved?

I met the same problem as kforner recently. I just wonder whether GDS A1/A2 and PED A1/A2 are consistent now.

I got gds files from other person. After I convert gds to plink bed with snpgdsGDS2BED function, I found the result swaped A1 and A2. Is there a way to appoint which one is REF allele when using snpgdsGDS2BED?