I first posted this as a comment on an existing similar issue, but given that issue has had little activity in recent years, I thought I might also post this as a new issue.
The dataset in question derives from whole-genome sequencing data, and has 3,309,311 total SNPs. The .vcf file used as input to OutFLANK can be downloaded from my Dropbox. The R script I used to process this dataset with OutFLANK can also be downloaded from my Dropbox here. In short, I used the following code for the primary analysis:
I also tried the above code with NumerOfSamples=3.
Both NumberOfSamples values resulted in the same output: 2 outlier SNPs identified out of 3,309,311.
Using pcadapt, with the same dataset, I identified between 11,079 and 556,880 outlier SNPs (depending on the value of k and outlier cutoff method used).
I understand that OutFLANK reduces Type I error rate compared with other programs, but this seems to be an extreme difference, between 2 and 11,079.
I am also working with 3 other datasets from the same group of samples (with different coverage levels - this dataset in question was not downsampled to standardize coverage, so coverage varies from 1x - 27x, with n=41 samples). The other datasets have fewer samples (n=14, 25, 39 for 10x, 5x, and 2x downsampling, respectively). For these datasets I ran the same procedure in OutFLANK as for the non-downsampled dataset in question, and resulted with 0 outliers for all 3.
I produced all the possible plots in OutFLANK, but am not sure from their interpretation how to explain this large discrepancy between outlier identification with OutFLANK versus pcadapt.
Any insights you might have would be greatly appreciated. I'm happy to provide more information as needed.
Hi there,
I first posted this as a comment on an existing similar issue, but given that issue has had little activity in recent years, I thought I might also post this as a new issue.
The dataset in question derives from whole-genome sequencing data, and has 3,309,311 total SNPs. The .vcf file used as input to OutFLANK can be downloaded from my Dropbox. The R script I used to process this dataset with OutFLANK can also be downloaded from my Dropbox here. In short, I used the following code for the primary analysis:
OutFLANK(FstDataFrame,LeftTrimFraction=0.05, RightTrimFraction=0.05, Hmin=0.1, NumberOfSamples=11, qthreshold=0.05)
I also tried the above code with
NumerOfSamples=3
.Both
NumberOfSamples
values resulted in the same output: 2 outlier SNPs identified out of 3,309,311.Using pcadapt, with the same dataset, I identified between 11,079 and 556,880 outlier SNPs (depending on the value of k and outlier cutoff method used).
I understand that OutFLANK reduces Type I error rate compared with other programs, but this seems to be an extreme difference, between 2 and 11,079.
I am also working with 3 other datasets from the same group of samples (with different coverage levels - this dataset in question was not downsampled to standardize coverage, so coverage varies from 1x - 27x, with n=41 samples). The other datasets have fewer samples (n=14, 25, 39 for 10x, 5x, and 2x downsampling, respectively). For these datasets I ran the same procedure in OutFLANK as for the non-downsampled dataset in question, and resulted with 0 outliers for all 3.
I produced all the possible plots in OutFLANK, but am not sure from their interpretation how to explain this large discrepancy between outlier identification with OutFLANK versus pcadapt.
Any insights you might have would be greatly appreciated. I'm happy to provide more information as needed.
Best, Jilda
Originally posted by @jcaccavo in https://github.com/whitlock/OutFLANK/issues/10#issuecomment-1696996417