Closed bibb closed 9 years ago
Bernabé,
Happy to help. I will need some additional information from you. Can you show me the command line you ran? Does the code produce any output besides the SDTERR?
Hi, Thanks for the quick answer:
This is the command line I'm using:
$ vcflib/bin/pFst --target 0,1,2,3,4,5,6,7,8,9,10 --background 406,407,408,409,410,411,414,415,416,417,418,419,420,421 --file myvcf.vcf.gz --type PL
The program do not create any kind of output, just stopped with the segmentation fault error.
Can you provide me a line of your VCF file or a minimum test example that generates the error? What variant caller did you use?
Thank you for reporting this issue by the way.
I think I've found the problem, I'm using a VCF file generated by PLINK1.9 recode option, using bim, bed and fam files. The likelihoods flags GP, GL or PL are not present. I tried to use just genotipe counts with the --counts options but the program still requires a Genotype likelihood parameter.
There is an example line of m vcf:
1 57952 rs189727433 C A . . PR GT 0/0 0/0 0/0 1 69511 rs75062661 G A . . PR GT 0/0 0/0 0/0 1 86018 rs142878000 C G . . PR GT ./. 0/0 0/0 1 356537 rs183318141 G A . . PR GT 1/1 1/1 1/1 1 713977 rs74512038 C T . . PR GT 0/0 0/0 0/0 1 752721 rs3131972 G A . . PR GT 0/0 0/0 0/0 1 754334 rs3131967 C T . . PR GT 0/0 0/0 0/0 1 754503 rs3115859 A G . . PR GT 0/0 0/0 0/0 1 754894 rs184597775 G A . . PR GT 0/0 0/1 0/0 1 754964 rs3131966 T C . . PR GT 0/0 0/0 0/0 1 755274 rs78408995 C T . . PR GT 0/0 0/0 0/0
I've calculated pairwise Weir and Cockerham Fst values for each SNP in PLINK in order to detect divergent locus. In this paper (http://dx.doi.org/10.1038/ng.78) they use a Fst cutoff of 0.65 to detetec significantly divergent SNPs, but reading other articles, is more common to do a statistical approach and calculate p values according to empirical Fst distributions to detect outliers, that's why I wanted to use your program. But I think maybe I'm not gonna be able to use it with this kind of input. Do you know other options to do this?. Thanks again.
Ahh. The option "--type GT" should work for you. pFst is a test for allele frequency differentiation between two populations. If you want a strictly Fst based method try "wcFst" and bootstrap the output.
Thanks again, I really appreiciate your help!. I'm running the program with that option, it's giving output already. So pFst is not an actual Fixation index statistic like WC ?, it's like a chi-square statistic?. And the other thing, I'm introducing myself to this kind of analysis, when you say boostrap the output is to manually do permutations? is there a way to do this by a software?.
You are correct, it is not a fixation index as it does not use heterozygosity. A good paper that describes the gist of pFst is :
"Design of association studies with pooled or un-pooled next-generation sequencing data."
So here is my typical workflow:
run wcFst. plot the output using the GPAT++ R scripts. run permutations to assign an emperical p-value to the Weir and Cockerham 1984 Fst values. I haven't pushed the the permutation code into GPAT++, but if you are interested I can do that next week.
If you have a candidate region you can also run "bFst", which will give you a confidence interval around the Fst estimate. WARNING: the code is very slow as it has to do MCMC.
That's very clear, thanks a lot. It would be great if you can push the permutation code to GPAT++, It will be very useful to detect truly FST outliers, thus making a more accurate biological interpretation of the results. I'll definitely look forward to that upgrade.
Thanks again, have a great weekend! Cheers
Bernabé
Hello! I wanted to know if you have any news about implementing the boostraping option to GPAT++ as you said last week?
Cheers!
Bernabé
I'm almost finished. The code will be ready by Monday. Thank you for reminding me.
Thank you, I'll look forward to it!
Hello, I was reading about the outlier methods for detecting highly divergent SNPs and I've noticed the 95th percentile method following an empirical distribution is one of the most used one. Is this the method you are applying to GPAT?
Greetings,
The code is in place: https://github.com/jewmanchue/vcflib/wiki/permutation
You will need to re-install GPAT.
The method I implemented reports the "percentile - or - empirical p-value".
Thank you very much!, I'll will try it right away. So the input file is the output file from wcFST right?
Yes. I will update the code and documents after I defend my thesis this Thursday.
Thanks again and good luck!
I'm downloading and reinstalling everything but this error appears: make: *\ No rule to make target 'src/wham-burden.cpp', needed by 'bin/vcfecho'. Stop. That file "wham-burden.cpp" is no more present to complie?
sorry. Try now.
Thanks, but this appears now
Makefile:172: recipe for target 'bin/iHS' failed make: *\ [bin/iHS] Error 1
I'm not able to replicate this bug. Please try a clean install:
rm -rf vcflib
git clone --recursive https://github.com/jewmanchue/vcflib.git vcflib
cd vcflib
make
Yes, I did it again, this is the whole Error output...
src/iHS.cpp:126:18: error: call of overloaded ‘isnan(double&)’ is ambiguous if(isnan(iHSA) || isnan(iHSR)){ ^ src/iHS.cpp:126:18: note: candidates are: In file included from /usr/include/features.h:374:0, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/os_defines.h:39, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/c++config.h:430, from /usr/include/c++/4.9/bits/stl_algobase.h:59, from /usr/include/c++/4.9/vector:60, from src/Variant.h:4, from src/iHS.cpp:1: /usr/include/x86_64-linux-gnu/bits/mathcalls.h:234:1: note: int isnan(double) MATHDECL_1 (int,isnan,, (Mdouble value)) attribute ((const)); ^ In file included from /usr/include/c++/4.9/random:38:0, from /usr/include/c++/4.9/bits/stl_algo.h:66, from /usr/include/c++/4.9/algorithm:62, from ./smithwaterman/SmithWatermanGotoh.h:4, from src/Variant.h:19, from src/iHS.cpp:1: /usr/include/c++/4.9/cmath:626:3: note: constexpr bool std::isnan(long double) isnan(long double x) ^ /usr/include/c++/4.9/cmath:622:3: note: constexpr bool std::isnan(double) isnan(double x) ^ /usr/include/c++/4.9/cmath:618:3: note: constexpr bool std::isnan(float) isnan(float x) ^ src/iHS.cpp:126:33: error: call of overloaded ‘isnan(double&)’ is ambiguous if(isnan(iHSA) || isnan(iHSR)){ ^ src/iHS.cpp:126:33: note: candidates are: In file included from /usr/include/features.h:374:0, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/os_defines.h:39, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/c++config.h:430, from /usr/include/c++/4.9/bits/stl_algobase.h:59, from /usr/include/c++/4.9/vector:60, from src/Variant.h:4, from src/iHS.cpp:1: /usr/include/x86_64-linux-gnu/bits/mathcalls.h:234:1: note: int isnan(double) __MATHDECL_1 (int,isnan,, (Mdouble value)) attribute ((const)); ^ In file included from /usr/include/c++/4.9/random:38:0, from /usr/include/c++/4.9/bits/stl_algo.h:66, from /usr/include/c++/4.9/algorithm:62, from ./smithwaterman/SmithWatermanGotoh.h:4, from src/Variant.h:19, from src/iHS.cpp:1: /usr/include/c++/4.9/cmath:626:3: note: constexpr bool std::isnan(long double) isnan(long double x) ^ /usr/include/c++/4.9/cmath:622:3: note: constexpr bool std::isnan(double) isnan(double x) ^ /usr/include/c++/4.9/cmath:618:3: note: constexpr bool std::isnan(float) isnan(float __x) ^ Makefile:172: recipe for target 'bin/iHS' failed make: *\ [bin/iHS] Error 1
"should" be fixed now.
Thanks again!, Its working now. One last thing, I performed 1000 permutations on my wcFST output, Just to make sure that I'm interpreting right the results:
1 12354 0.0454545 0.142857 0.000859343 257 1000 0.257 1 12356 0.3636 0.3928 -0.0439 10 1000 0.01
The first result, column 5 shows a Fst value of 0.000085 then 1000 permutations with 257 successes, then yielding an empirical p value of 0.257. The second SNP has an empirical p-value of 0.031. When in papers they say:
"We calculated the global FST for each single SNP among the three populations. For autosomal SNPs, we considered a candidate selection if the SNP FST >= 0.58, which corresponds to a genome-wide empirical significance level of alpha 0.01."
Then, it means that if I want to emulate this in my data, I have to select all values from 0.01 and below?. How this empirical p value is matched with a Fst value like the example?.
yes, you want empirical p-values below 0.01 or the top 1% of all Fst values.
Ok, filtering all empirical p values by 0.01 and below wild yield a range of FST values from low to high, not only from a certain threshold like the example citation. Maybe I'm getting this wrong. I need to apply two filters?, first by the empirical p value and then the upper 95th percentile of the filtered Fst to see the most divergent ones?.
Thanks for all the feedback.
I was doing the inverse permutation. The code is updated and here is a plot vetting the code:
Please let me know if the permutation test is working as you desired.
--Zev
Thank you very much Zev for implementing the permutation tool. I ran the tool several times now and I was thinking about the method of getting the empirical probability. The plots I'm having follows almost the same distribution shape of the plot you showed above. My question is that there is no high Fst values with poor pvalues, I mean, Fst values > 0.6 and I was thinking there sould be some snps with high Fst values but with low empirical palue not passing the 95th percentile treshold which I'm currently applying to get the significant results. I'm I correct with that thinking?.
Just to make myself clear with example:
Freq1 Freq2 S N P 0.8 0.1 321 10000 0.0321
The example says that doing 10.000 permutations on the frequency values of Freq1 and Freq2 I got 321 times the same Fst number as obtained with the fixed Freq values?.
Would you mind sharing a plot? Just drop the image in the text box.
Maybe I'm just having troubles interpreting the results? With this plot, I made a cutoff at the 95th percentile of the empirical distribution (as is used in papers), applying that criteria to this plot there should be a horizontal line a little bit avobe the value 1, which is going to represent the p value of 0.05 (95th upper threshold), so when that line touches the points when the Fst values in X axis is 0.42, so thats is my Fst threshold to detect significantly divergent snps. That's the criteria I've been using to select variants.
Would you like to arrange a video call? It might be easiest.
Off course, we can use google talk, my email is bibb77@gmail.com.
Hello Zev. How can I cite your tools in a publication?
For now please cite the github page. If you tell me which tools you used I can give you the appropriate citations.
also please feel free to contact me @ zev.kronenberg@gmail.com
I have used wcFST and permuteGPAT++. I would like to cite both, Thanks.
G'day Zev and Bibb,
I am currently implementing wcFst and then running the new permute function and everything is working well. This conversation has been extremely helpful for me to be able to use both wcFst and permute, however, I am having an issue interpreting the results as bibb expressed a couple of weeks ago.
I get extremely similar plots to the ones Bibb posted and am wondering whether it is appropriate to simply take those points that fall below an empirical p-value of 0.01 as being significantly divergent SNPs (or -log10(p) >= 2).
I ran 10,000 permutations in wcFst and have plotted the wcFst results from several different data treatments.
Thank you for your help.
Regards, Adam
Adam,
I'm happy to hear that wcFst and the permutation are working for you. That is really nice graph. The catch with Fst is it is only a random variable. Permutations will give you the empirical probability of each Fst value within a single dataset. Comparing empirical probabilities across multiple treatments is not valid. Have a look at (http://en.wikipedia.org/wiki/Empirical_probability)
I'm not sure what system you are working in, but I usually like to see a correlation between Fst values and genomic positions. Do you see regions with elevated Fst values? If there is a lot of noise, smoothing might be helpful (GPAT++ : smoother). I could also build a permutation test for regions if that would be useful.
--Zev
bib - here are the citations:
Weir, B.S., and Cockerham, C.C. (1984). Estimating F-statistics for the analysis of population structure. Evolution (N. Y). 1358–1370.
how to cite a github repo: http://academia.stackexchange.com/questions/14010/how-do-you-cite-a-github-repository
G'day Zev,
Thanks for the response and advice, I hadn't intending to compare probabilities across treatments, it was just easier to include everything in the same graph. If I am understanding correctly, the empirical probability is simply an indication of how likely you are to get the Fst value (which is subject to variation due to chance) for a given SNP. So basically, I can be more confident in the Fst value of SNPs with low empirical probabilities.
I am using genotype-by-sequencing SNP data with no reference genome so unfortunately I don't have any information about genomic position. It's not ideal...
Thanks again, Adam
Hi Zev,
I'm coming back to this analysis after a while. I'm having troubles understanding the permutation procedure of permuteGPAT++ (I'm not an statistician). I know that an empirical probability is the chance of finding a value within the data (as the software documentation says), so then what is being used in the permutation? I know that the software is not permuting individuals (e.g. target-background population individuals), so what is used?, what do you mean with number of trials and successes?. I would really appreciate your help in this, I'll look forward to your answer.
Cheers
Bernabe
@bibb lets say you've run wcFst and then permuteGPAT++. The output is the empirical probability of observing a value greater than the input.
example:
Observed Fst value : 0.7
random sample from the same dataset: 0.1 random sample from the same dataset: 0.75 random sample from the same dataset: 0.2
The empirical probability is 1/3.
The number of successes is 1.
The number of trials is 3.
If you want to shuffle individuals you will have to write small script to randomly subset the target and background. I like the idea of shuffling the individuals, i will add it to the list of thing to implement.
sorry for the late reply.
--Zev
Hello,
I'm usign GPAT++ pFST function to get Fst Pvalues across populations. I'have followed the instructions for installation and then used the program but an error ocurrs:
INFO: There are 11 individuals in the target INFO: target ids: 0,1,2,3,4,5,6,7,8,9,10 INFO: There are 14 individuals in the background INFO: background ids: 406,407,408,409,410,411,414,415,416,417,418,419,420,421 INFO: File: myvcf.vcf.gz INFO: genotype likelihoods set to: PL Segmentation fault (core dumped)
My SO is Ubuntu 14.10 64-bits
Hope you can help me
Best regards,
--Bernabé