zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

re-calibrate ref/alt allele #33

Closed thierrygosselin closed 6 years ago

thierrygosselin commented 6 years ago

When a VCF file is imported SeqArray reads and use the default ref/alt alleles. However, when samples are removed, depending on MAC/MAF, ref/alt alleles might change ...

Question Is there a quick way to re-calibrate what is considered a ref allele in SeqArray/GDS ?

Best Thierry

zhengxwen commented 6 years ago

Rearranging ref. and alt. alleles requires recompressing data. I don't have such a function to rearrange alleles, but we could consider adding this function.

thierrygosselin commented 6 years ago

I currently do it for small data set, within R. I think it could be a welcomed addition to your packages (SeqArray, SeqVarTools), lots of student are removing samples and most of the time it requires re-calibrating ...

thierrygosselin commented 6 years ago

otherwise the statistics based on MAC/MAF are wrong...

smgogarten commented 6 years ago

Ref/alt alleles should not change when samples are removed - they depend on the reference genome to which the reads were aligned. While the ref allele is usually major, there are almost always cases where it is actually minor in a given sample set. If you are calculating MAC/MAF from a VCF or SeqArray file, you should always check the allele frequency first:

freq <- seqAlleleFreq(gds)
maf <- pmin(freq, 1-freq)

If you have multi-allelic variants, then the concept of minor allele frequency becomes ambiguous. In the above code, maf might represent the cumulative allele frequency of multiple alternate alleles.

thierrygosselin commented 6 years ago

I agree we should always check the freq first, but it’s good to have a calibrated vcf too, because most people are not aware of this... and take the vcf for granted.

For those of us who work in population genomic of non-human species, the current technology is RADseq. Coverage is usually low (< 50).

For most software if not all, the REF and ALT are defined by counts (hence minor allele counts or frequency...). With a reference genome or de novo, it’s the same. I understand the ambiguity with haplotypes vcf. But the REF allele is still the most abundant one.

It’s not uncommon to discard individuals during filtering steps, sometimes as many as 30% of the original dataset. This changes things.

So I reiterate my comment, the REF\ALT allele DO change with discarded samples, I see this with almost all the datasets I come across.

Still not convince ? Pick one in Molecular Ecology... The latest one sent to me for review had 200K, unfiltered biallelic SNPs. 5000 SNP had the wrong REF, because author had remove samples along their pipeline...

Cheers Thierry