whitlock / OutFLANK

A procedure to find Fst outliers based on an inferred distribution of neutral Fst
18 stars 9 forks source link

Input size limit? #1

Open lidiaiprou opened 9 years ago

lidiaiprou commented 9 years ago

Dear OutFLANK developers and users, I am trying to use OutFlank to find Fst outliers in a set of 3,749,072 SNPs. I am trying to use the function MakeDiploidFSTMat() to create the dataframe to run OutFlank but it seems in only works when the input is smaller or equal to 10,000 SNPs.

When the input is 10,000 SNPs or smaller it works perfectly:

pos<-read.table("test10000.pos",header=FALSE)
> PopNames<-read.table("test10000.pop",header=FALSE)
> SNPmat<-read.table("test10000.0129", header=FALSE)
> LocusNames<-mapply(function(x,y) paste(x,y,sep=":"), pos$V1, pos$V2)
> FSTMat<-MakeDiploidFSTMat(SNPmat,LocusNames,PopNames) 
Calculating FSTs, may take a few minutes...
[1] "10000 done of 10000"
 Result<-OutFLANK(FstDataFrame=FSTMat, LeftTrimFraction=0.05, RightTrimFraction=0.05, Hmin=0.1, NumberOfSamples=2, qthreshold=0.05)
> head(Result)
$FSTbar
[1] 0.02457544

$FSTNoCorrbar
[1] 0.04532807

$dfInferred
[1] 2

$numberLowFstOutliers
[1] 0

$numberHighFstOutliers
[1] 0

$results
      LocusName         He           FST            T1         T2    FSTNoCorr
1     2L:530008         NA            NA            NA         NA           NA
2     2L:530022 0.13265306 -2.286908e-02 -1.515983e-03 0.06628965 1.566891e-03
...

However, when the input matrix is larger than 10,000 SNPs, it dies with a "Execution halted"...

> pos<-read.table("test100000.pos",header=FALSE)
> PopNames<-read.table("test100000.pop",header=FALSE)
> SNPmat<-read.table("test100000.0129", header=FALSE)
> LocusNames<-mapply(function(x,y) paste(x,y,sep=":"), pos$V1, pos$V2)
> FSTMat<-MakeDiploidFSTMat(SNPmat,LocusNames,PopNames) 
Calculating FSTs, may take a few minutes...
[1] "10000 done of 100000"
Error in Sample_Mat[, 1] : subscript out of bounds
Calls: MakeDiploidFSTMat ... unlist -> getFSTs_diploids -> WC_FST_Diploids_2Alleles
Execution halted

Have you encountered a similar problem? Do you know how could I solve it?

Thanks!

DrK-Lo commented 9 years ago

The program works with input files of any size, but there was an error in the printing (i.e. it only printed "10,000" once).

I suspect your other error

Error in Sample_Mat[, 1] : subscript out of bounds was due to a problem with a locus in your input file. Please make sure that you don't have any fixed genotypes in your input file. If you find the problem locus, you can send us the allele frequencies and we'll check it out.

Katie


Dr. K.E. Lotterhos Assistant Professor, Biology Department, Wake Forest University, Box 7325 Reynolda Station, Winston-Salem, NC 27109 http://college.wfu.edu/biology/people/faculty/lotterhos/ Office: 336-758-3157


On Nov 20, 2014, at 6:39 AM, lidiaiprou wrote:

Dear OutFLANK developers and users, I am trying to use OutFlank to find Fst outliers in a set of 3,749,072 SNPs. I am trying to use the function MakeDiploidFSTMat() to create the dataframe to run OutFlank but it seems in only works when the input is smaller or equal to 10,000 SNPs.

When the input is 10,000 SNPs or smaller it works perfectly:

pos<-read.table("test10000.pos",header=FALSE)

PopNames<-read.table("test10000.pop",header=FALSE) SNPmat<-read.table("test10000.0129", header=FALSE) LocusNames<-mapply(function(x,y) paste(x,y,sep=":"), pos$V1, pos$V2) FSTMat<-MakeDiploidFSTMat(SNPmat,LocusNames,PopNames) Calculating FSTs, may take a few minutes... [1] "10000 done of 10000" Result<-OutFLANK(FstDataFrame=FSTMat, LeftTrimFraction=0.05, RightTrimFraction=0.05, Hmin=0.1, NumberOfSamples=2, qthreshold=0.05) head(Result) $FSTbar [1] 0.02457544

$FSTNoCorrbar [1] 0.04532807

$dfInferred [1] 2

$numberLowFstOutliers [1] 0

$numberHighFstOutliers [1] 0

$results LocusName He FST T1 T2 FSTNoCorr 1 2L:530008 NA NA NA NA NA 2 2L:530022 0.13265306 -2.286908e-02 -1.515983e-03 0.06628965 1.566891e-03 ... However, when the input matrix is larger than 10,000 SNPs, it dies with a "Execution halted"...

pos<-read.table("test100000.pos",header=FALSE) PopNames<-read.table("test100000.pop",header=FALSE) SNPmat<-read.table("test100000.0129", header=FALSE) LocusNames<-mapply(function(x,y) paste(x,y,sep=":"), pos$V1, pos$V2) FSTMat<-MakeDiploidFSTMat(SNPmat,LocusNames,PopNames) Calculating FSTs, may take a few minutes... [1] "10000 done of 100000" Error in Sample_Mat[, 1] : subscript out of bounds Calls: MakeDiploidFSTMat ... unlist -> getFSTs_diploids -> WC_FST_Diploids_2Alleles Execution halted Have you encountered a similar problem? Do you know how could I solve it?

Thanks!

— Reply to this email directly or view it on GitHub.

whitlock commented 9 years ago

I reckon that you sent this to them separately? Because I was the only one on the address line. Figured I would flag it just in case...

Michael Whitlock Department of Zoology University of British Columbia 6270 University Blvd. Vancouver, BC V6T1Z4 Canada

On Nov 20, 2014, at 7:54 AM, Katie Lotterhos notifications@github.com wrote:

The program works with input files of any size, but there was an error in the printing (i.e. it only printed "10,000" once).

I suspect your other error

Error in Sample_Mat[, 1] : subscript out of bounds was due to a problem with a locus in your input file. Please make sure that you don't have any fixed genotypes in your input file. If you find the problem locus, you can send us the allele frequencies and we'll check it out.

Katie


Dr. K.E. Lotterhos Assistant Professor, Biology Department, Wake Forest University, Box 7325 Reynolda Station, Winston-Salem, NC 27109 http://college.wfu.edu/biology/people/faculty/lotterhos/ Office: 336-758-3157


On Nov 20, 2014, at 6:39 AM, lidiaiprou wrote:

Dear OutFLANK developers and users, I am trying to use OutFlank to find Fst outliers in a set of 3,749,072 SNPs. I am trying to use the function MakeDiploidFSTMat() to create the dataframe to run OutFlank but it seems in only works when the input is smaller or equal to 10,000 SNPs.

When the input is 10,000 SNPs or smaller it works perfectly:

pos<-read.table("test10000.pos",header=FALSE)

PopNames<-read.table("test10000.pop",header=FALSE) SNPmat<-read.table("test10000.0129", header=FALSE) LocusNames<-mapply(function(x,y) paste(x,y,sep=":"), pos$V1, pos$V2) FSTMat<-MakeDiploidFSTMat(SNPmat,LocusNames,PopNames) Calculating FSTs, may take a few minutes... [1] "10000 done of 10000" Result<-OutFLANK(FstDataFrame=FSTMat, LeftTrimFraction=0.05, RightTrimFraction=0.05, Hmin=0.1, NumberOfSamples=2, qthreshold=0.05) head(Result) $FSTbar [1] 0.02457544

$FSTNoCorrbar [1] 0.04532807

$dfInferred [1] 2

$numberLowFstOutliers [1] 0

$numberHighFstOutliers [1] 0

$results LocusName He FST T1 T2 FSTNoCorr 1 2L:530008 NA NA NA NA NA 2 2L:530022 0.13265306 -2.286908e-02 -1.515983e-03 0.06628965 1.566891e-03 ... However, when the input matrix is larger than 10,000 SNPs, it dies with a "Execution halted"...

pos<-read.table("test100000.pos",header=FALSE) PopNames<-read.table("test100000.pop",header=FALSE) SNPmat<-read.table("test100000.0129", header=FALSE) LocusNames<-mapply(function(x,y) paste(x,y,sep=":"), pos$V1, pos$V2) FSTMat<-MakeDiploidFSTMat(SNPmat,LocusNames,PopNames) Calculating FSTs, may take a few minutes... [1] "10000 done of 100000" Error in Sample_Mat[, 1] : subscript out of bounds Calls: MakeDiploidFSTMat ... unlist -> getFSTs_diploids -> WC_FST_Diploids_2Alleles Execution halted Have you encountered a similar problem? Do you know how could I solve it?

Thanks!

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

lidiaiprou commented 9 years ago

Dear developers and OutFLANK users, As you mentioned, the problem is not related to input size. As you pointed out, the problem came from the genotypes themselves. After spending some time trying to find the problem locus, I decided to try some toy examples using your code and I think I have found a problematic situation that was causing errors in my analysis. The problem comes when the allele frequencies are exactly equal in the populations you are analyzing (yes, it might be an unusual scenario... but it happens!). When the allele frequencies are exactly the same in all the populations, the sum of squares is equal to 0 and the function WC_FST_Diploids_2Alleles returns 0 and breaks the execution so that the remaining positions are not analyzed. I don't know if this is the behavior you expected. To bypass the problem I have substituted the if(s2==0){return(0); break} by this if(s2==0){return (list(He=NA,FST=NA, T1=NA, T2=NA,FSTNoCorr=NA, T1NoCorr=NA, T2NoCorr=NA,meanAlleleFreq = NA))}. I am not sure this is the best option but at least the script should keep on analyzing the rest of the file and I would only lose information about one position. I understand having s2==0 is a problem because to calculate the non corrected Fst you would have a division by 0 error... but may be you could find a better way to handle this error... I leave here the code for the toy example.

# This is your function with the small proposed modification

WC_FST_Diploids_2Alleles<-function(Sample_Mat){
  ##Calculate both Fst and Fst NoCorr at the same time, from WC84
  #Sample Mat has three columns (homo_p,m heterozygotes, and homo_q) and a row for each population

  sample_sizes = rowSums(Sample_Mat)
  n_ave = mean(sample_sizes)
  n_pops = nrow(Sample_Mat) #r
  r = n_pops
  n_c = (n_pops*n_ave - sum(sample_sizes^2)/(n_pops*n_ave))/(n_pops-1)
  p_freqs = (Sample_Mat[,1] + Sample_Mat[,2]/2) /sample_sizes
  p_ave = sum(sample_sizes*p_freqs)/(n_ave*n_pops)

  s2 = sum(sample_sizes*(p_freqs - p_ave)^2)/((n_pops-1)*n_ave)
  #if(s2==0){return(0); break}  
  if(s2==0){return (list(He=NA,FST=NA, T1=NA, T2=NA,FSTNoCorr=NA, T1NoCorr=NA, T2NoCorr=NA,meanAlleleFreq = NA))}
  h_freqs = Sample_Mat[,2]/sample_sizes
  h_ave = sum(sample_sizes*h_freqs)/(n_ave*n_pops)

  a <- n_ave/n_c*(s2 - 1/(n_ave-1)*(p_ave*(1-p_ave)-((r-1)/r)*s2-(1/4)*h_ave))

  b <- n_ave/(n_ave-1)*(p_ave*(1-p_ave) - (r-1)/r*s2 - (2*n_ave - 1)/(4*n_ave)*h_ave)

  c <- 1/2*h_ave

  aNoCorr <- n_ave/n_c*(s2)

  bNoCorr <- (p_ave*(1-p_ave) - (r-1)/r*s2 - (2*n_ave)/(4*n_ave)*h_ave)

  cNoCorr <- 1/2*h_ave

  He <- 1-sum(p_ave^2, (1-p_ave)^2)

  FST <- a/(a+b+c) 
  FSTNoCorr = aNoCorr/(aNoCorr+bNoCorr+cNoCorr)
  return(list(He=He,FST=FST, T1=a, T2=(a+b+c),FSTNoCorr=FSTNoCorr, T1NoCorr=aNoCorr, T2NoCorr=(aNoCorr+bNoCorr+cNoCorr),meanAlleleFreq = p_ave))
}

# let's imagine we have two populations (Eur_It and Eur_Sw) with 5 individuals each.
popnames<-c("Eur_It","Eur_It","Eur_It","Eur_It","Eur_It","Eur_Sw","Eur_Sw","Eur_Sw","Eur_Sw","Eur_Sw")

# here you can try several scenarios:
snpDataColumn<-c(0,1,1,0,0,0,0,0,1,1) #NO DIFFERENTIATION WITH SAME ALLELE FREQUENCIES
#snpDataColumn<-c(2,2,2,2,2,0,0,1,1,0) # moderate differentiation
#snpDataColumn<-c(2,2,2,2,2,0,0,0,0,0) # maximum differentiation

snpDataTemp=snpDataColumn[snpDataColumn!=9]
popNameTemp=popnames[which(snpDataColumn!=9)]
HetCounts <- tapply(snpDataTemp, list(popNameTemp,snpDataTemp), length)
HetCounts[is.na(HetCounts)] = 0

if(dim(HetCounts)[2]==2){
  if(paste(colnames(HetCounts),collapse="")=="01"){HetCounts=cbind(HetCounts,"2"=0)}
  if(paste(colnames(HetCounts),collapse="")=="12"){HetCounts=cbind("0"=0,HetCounts)} 
  if(paste(colnames(HetCounts),collapse="")=="02"){HetCounts=cbind(HetCounts[,1],"1"=0, HetCounts[,2])}
}

Sample_Mat<-HetCounts

sample_sizes = rowSums(HetCounts)
n_ave = mean(sample_sizes)
n_pops = nrow(HetCounts) #r
r = n_pops
n_c = (n_pops*n_ave - sum(sample_sizes^2)/(n_pops*n_ave))/(n_pops-1)
p_freqs = (Sample_Mat[,1] + Sample_Mat[,2]/2) /sample_sizes
p_ave = sum(sample_sizes*p_freqs)/(n_ave*n_pops)
s2 = sum(sample_sizes*(p_freqs - p_ave)^2)/((n_pops-1)*n_ave)
h_freqs = Sample_Mat[,2]/sample_sizes
h_ave = sum(sample_sizes*h_freqs)/(n_ave*n_pops)

a <- n_ave/n_c*(s2 - 1/(n_ave-1)*(p_ave*(1-p_ave)-((r-1)/r)*s2-(1/4)*h_ave))

b <- n_ave/(n_ave-1)*(p_ave*(1-p_ave) - (r-1)/r*s2 - (2*n_ave - 1)/(4*n_ave)*h_ave)

c <- 1/2*h_ave

FST <- a/(a+b+c)

OutFLANK<-WC_FST_Diploids_2Alleles(Sample_Mat)
emilya21 commented 4 years ago

Hi, I am having the same issue as the author above: Error in Sample_Mat[, 1] : subscript out of bounds

I have a very large dataset of over 10 million SNPs so I have been unable to locate where the problem is. Can you suggest some way of identifying where the problem might be or what to look for? I do not think that I have any fixed genotypes. Additionally, if I am having the same issue as above, that some allele frequencies are exactly equal in my populations, has this issue been fixed? Thank you!

DrK-Lo commented 4 years ago

Hi Emily, The code within the Outflank functions is relatively straightforward to run line by line, which should help you isolate the issue. OutFLANK assumes that the user has already filtered their data appropriately. On May 27, 2020, 2:54 PM -0400, Emily Abernathy notifications@github.com, wrote:

Hi, I am having the same issue as the author above: Error in Sample_Mat[, 1] : subscript out of bounds I have a very large dataset of over 10 million SNPs so I have been unable to locate where the problem is. Can you suggest some way of identifying where the problem might be or what to look for? I do not think that I have any fixed genotypes. Additionally, if I am having the same issue as above, that some allele frequencies are exactly equal in my populations, has this issue been fixed? Thank you! — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

rturba commented 2 years ago

Hello, I'm also running into the same issues, not sure why. I've followed the recommendations on cleaning the data:

subG <- subset(tG, rownames(tG) %in% genoIDs)

identical(rownames(subG), as.character(submeta$sample_id_old)) TRUE

my_fst <- MakeDiploidFSTMat(subG, locusNames = 1:ncol(subG), popNames = submeta$locality_short) Calculating FSTs, may take a few minutes... Error in Sample_Mat[, 1] : subscript out of bounds


I have also tried editing the function `WC_FST_Diploids_2Alleles` as mentioned here and I still get the same thing. Does anyone have any insight what might be wrong?