zeeev / vcflib

a simple C++ library for parsing and manipulating VCF files, + many command-line utilities
https://github.com/ekg/vcflib#vcflib
MIT License
19 stars 6 forks source link

Segmentation fault #6

Closed bibb closed 9 years ago

bibb commented 9 years ago

Hello,

I'm usign GPAT++ pFST function to get Fst Pvalues across populations. I'have followed the instructions for installation and then used the program but an error ocurrs:

INFO: There are 11 individuals in the target INFO: target ids: 0,1,2,3,4,5,6,7,8,9,10 INFO: There are 14 individuals in the background INFO: background ids: 406,407,408,409,410,411,414,415,416,417,418,419,420,421 INFO: File: myvcf.vcf.gz INFO: genotype likelihoods set to: PL Segmentation fault (core dumped)

My SO is Ubuntu 14.10 64-bits

Hope you can help me

Best regards,

--Bernabé

zeeev commented 9 years ago

Bernabé,

Happy to help. I will need some additional information from you. Can you show me the command line you ran? Does the code produce any output besides the SDTERR?

bibb commented 9 years ago

Hi, Thanks for the quick answer:

This is the command line I'm using:

$ vcflib/bin/pFst --target 0,1,2,3,4,5,6,7,8,9,10 --background 406,407,408,409,410,411,414,415,416,417,418,419,420,421 --file myvcf.vcf.gz --type PL

The program do not create any kind of output, just stopped with the segmentation fault error.

zeeev commented 9 years ago

Can you provide me a line of your VCF file or a minimum test example that generates the error? What variant caller did you use?

Thank you for reporting this issue by the way.

bibb commented 9 years ago

I think I've found the problem, I'm using a VCF file generated by PLINK1.9 recode option, using bim, bed and fam files. The likelihoods flags GP, GL or PL are not present. I tried to use just genotipe counts with the --counts options but the program still requires a Genotype likelihood parameter.

There is an example line of m vcf:

fileformat=VCFv4.2

fileDate=20150410

source=PLINKv1.90

INFO=ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"

FORMAT=ID=GT,Number=1,Type=String,Description="Genotype"

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT I1 I2 I3 I4

1 57952 rs189727433 C A . . PR GT 0/0 0/0 0/0 1 69511 rs75062661 G A . . PR GT 0/0 0/0 0/0 1 86018 rs142878000 C G . . PR GT ./. 0/0 0/0 1 356537 rs183318141 G A . . PR GT 1/1 1/1 1/1 1 713977 rs74512038 C T . . PR GT 0/0 0/0 0/0 1 752721 rs3131972 G A . . PR GT 0/0 0/0 0/0 1 754334 rs3131967 C T . . PR GT 0/0 0/0 0/0 1 754503 rs3115859 A G . . PR GT 0/0 0/0 0/0 1 754894 rs184597775 G A . . PR GT 0/0 0/1 0/0 1 754964 rs3131966 T C . . PR GT 0/0 0/0 0/0 1 755274 rs78408995 C T . . PR GT 0/0 0/0 0/0

I've calculated pairwise Weir and Cockerham Fst values for each SNP in PLINK in order to detect divergent locus. In this paper (http://dx.doi.org/10.1038/ng.78) they use a Fst cutoff of 0.65 to detetec significantly divergent SNPs, but reading other articles, is more common to do a statistical approach and calculate p values according to empirical Fst distributions to detect outliers, that's why I wanted to use your program. But I think maybe I'm not gonna be able to use it with this kind of input. Do you know other options to do this?. Thanks again.

zeeev commented 9 years ago

Ahh. The option "--type GT" should work for you. pFst is a test for allele frequency differentiation between two populations. If you want a strictly Fst based method try "wcFst" and bootstrap the output.

bibb commented 9 years ago

Thanks again, I really appreiciate your help!. I'm running the program with that option, it's giving output already. So pFst is not an actual Fixation index statistic like WC ?, it's like a chi-square statistic?. And the other thing, I'm introducing myself to this kind of analysis, when you say boostrap the output is to manually do permutations? is there a way to do this by a software?.

zeeev commented 9 years ago

You are correct, it is not a fixation index as it does not use heterozygosity. A good paper that describes the gist of pFst is :

"Design of association studies with pooled or un-pooled next-generation sequencing data."

So here is my typical workflow:

run wcFst. plot the output using the GPAT++ R scripts. run permutations to assign an emperical p-value to the Weir and Cockerham 1984 Fst values. I haven't pushed the the permutation code into GPAT++, but if you are interested I can do that next week.

If you have a candidate region you can also run "bFst", which will give you a confidence interval around the Fst estimate. WARNING: the code is very slow as it has to do MCMC.

bibb commented 9 years ago

That's very clear, thanks a lot. It would be great if you can push the permutation code to GPAT++, It will be very useful to detect truly FST outliers, thus making a more accurate biological interpretation of the results. I'll definitely look forward to that upgrade.

Thanks again, have a great weekend! Cheers

Bernabé

bibb commented 9 years ago

Hello! I wanted to know if you have any news about implementing the boostraping option to GPAT++ as you said last week?

Cheers!

Bernabé

zeeev commented 9 years ago

I'm almost finished. The code will be ready by Monday. Thank you for reminding me.

bibb commented 9 years ago

Thank you, I'll look forward to it!

bibb commented 9 years ago

Hello, I was reading about the outlier methods for detecting highly divergent SNPs and I've noticed the 95th percentile method following an empirical distribution is one of the most used one. Is this the method you are applying to GPAT?

zeeev commented 9 years ago

Greetings,

The code is in place: https://github.com/jewmanchue/vcflib/wiki/permutation

You will need to re-install GPAT.

The method I implemented reports the "percentile - or - empirical p-value".

bibb commented 9 years ago

Thank you very much!, I'll will try it right away. So the input file is the output file from wcFST right?

zeeev commented 9 years ago

Yes. I will update the code and documents after I defend my thesis this Thursday.

bibb commented 9 years ago

Thanks again and good luck!

bibb commented 9 years ago

I'm downloading and reinstalling everything but this error appears: make: *\ No rule to make target 'src/wham-burden.cpp', needed by 'bin/vcfecho'. Stop. That file "wham-burden.cpp" is no more present to complie?

zeeev commented 9 years ago

sorry. Try now.

bibb commented 9 years ago

Thanks, but this appears now

Makefile:172: recipe for target 'bin/iHS' failed make: *\ [bin/iHS] Error 1

zeeev commented 9 years ago

I'm not able to replicate this bug. Please try a clean install:

rm -rf vcflib

git clone --recursive https://github.com/jewmanchue/vcflib.git vcflib

cd vcflib

make

bibb commented 9 years ago

Yes, I did it again, this is the whole Error output...

src/iHS.cpp:126:18: error: call of overloaded ‘isnan(double&)’ is ambiguous if(isnan(iHSA) || isnan(iHSR)){ ^ src/iHS.cpp:126:18: note: candidates are: In file included from /usr/include/features.h:374:0, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/os_defines.h:39, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/c++config.h:430, from /usr/include/c++/4.9/bits/stl_algobase.h:59, from /usr/include/c++/4.9/vector:60, from src/Variant.h:4, from src/iHS.cpp:1: /usr/include/x86_64-linux-gnu/bits/mathcalls.h:234:1: note: int isnan(double) MATHDECL_1 (int,isnan,, (Mdouble value)) attribute ((const)); ^ In file included from /usr/include/c++/4.9/random:38:0, from /usr/include/c++/4.9/bits/stl_algo.h:66, from /usr/include/c++/4.9/algorithm:62, from ./smithwaterman/SmithWatermanGotoh.h:4, from src/Variant.h:19, from src/iHS.cpp:1: /usr/include/c++/4.9/cmath:626:3: note: constexpr bool std::isnan(long double) isnan(long double x) ^ /usr/include/c++/4.9/cmath:622:3: note: constexpr bool std::isnan(double) isnan(double x) ^ /usr/include/c++/4.9/cmath:618:3: note: constexpr bool std::isnan(float) isnan(float x) ^ src/iHS.cpp:126:33: error: call of overloaded ‘isnan(double&)’ is ambiguous if(isnan(iHSA) || isnan(iHSR)){ ^ src/iHS.cpp:126:33: note: candidates are: In file included from /usr/include/features.h:374:0, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/os_defines.h:39, from /usr/include/x86_64-linux-gnu/c++/4.9/bits/c++config.h:430, from /usr/include/c++/4.9/bits/stl_algobase.h:59, from /usr/include/c++/4.9/vector:60, from src/Variant.h:4, from src/iHS.cpp:1: /usr/include/x86_64-linux-gnu/bits/mathcalls.h:234:1: note: int isnan(double) __MATHDECL_1 (int,isnan,, (Mdouble value)) attribute ((const)); ^ In file included from /usr/include/c++/4.9/random:38:0, from /usr/include/c++/4.9/bits/stl_algo.h:66, from /usr/include/c++/4.9/algorithm:62, from ./smithwaterman/SmithWatermanGotoh.h:4, from src/Variant.h:19, from src/iHS.cpp:1: /usr/include/c++/4.9/cmath:626:3: note: constexpr bool std::isnan(long double) isnan(long double x) ^ /usr/include/c++/4.9/cmath:622:3: note: constexpr bool std::isnan(double) isnan(double x) ^ /usr/include/c++/4.9/cmath:618:3: note: constexpr bool std::isnan(float) isnan(float __x) ^ Makefile:172: recipe for target 'bin/iHS' failed make: *\ [bin/iHS] Error 1

zeeev commented 9 years ago

"should" be fixed now.

bibb commented 9 years ago

Thanks again!, Its working now. One last thing, I performed 1000 permutations on my wcFST output, Just to make sure that I'm interpreting right the results:

1 12354 0.0454545 0.142857 0.000859343 257 1000 0.257 1 12356 0.3636 0.3928 -0.0439 10 1000 0.01

The first result, column 5 shows a Fst value of 0.000085 then 1000 permutations with 257 successes, then yielding an empirical p value of 0.257. The second SNP has an empirical p-value of 0.031. When in papers they say:

"We calculated the global FST for each single SNP among the three populations. For autosomal SNPs, we considered a candidate selection if the SNP FST >= 0.58, which corresponds to a genome-wide empirical significance level of alpha 0.01."

Then, it means that if I want to emulate this in my data, I have to select all values from 0.01 and below?. How this empirical p value is matched with a Fst value like the example?.

zeeev commented 9 years ago

yes, you want empirical p-values below 0.01 or the top 1% of all Fst values.

bibb commented 9 years ago

Ok, filtering all empirical p values by 0.01 and below wild yield a range of FST values from low to high, not only from a certain threshold like the example citation. Maybe I'm getting this wrong. I need to apply two filters?, first by the empirical p value and then the upper 95th percentile of the filtered Fst to see the most divergent ones?.

zeeev commented 9 years ago

Thanks for all the feedback.

I was doing the inverse permutation. The code is updated and here is a plot vetting the code: screen shot 2015-04-20 at 1 44 14 pm

zeeev commented 9 years ago

Please let me know if the permutation test is working as you desired.

--Zev

bibb commented 9 years ago

Thank you very much Zev for implementing the permutation tool. I ran the tool several times now and I was thinking about the method of getting the empirical probability. The plots I'm having follows almost the same distribution shape of the plot you showed above. My question is that there is no high Fst values with poor pvalues, I mean, Fst values > 0.6 and I was thinking there sould be some snps with high Fst values but with low empirical palue not passing the 95th percentile treshold which I'm currently applying to get the significant results. I'm I correct with that thinking?.

Just to make myself clear with example:

Freq1 Freq2 S N P 0.8 0.1 321 10000 0.0321

The example says that doing 10.000 permutations on the frequency values of Freq1 and Freq2 I got 321 times the same Fst number as obtained with the fixed Freq values?.

zeeev commented 9 years ago

Would you mind sharing a plot? Just drop the image in the text box.

bibb commented 9 years ago

fst distribution

bibb commented 9 years ago

Maybe I'm just having troubles interpreting the results? With this plot, I made a cutoff at the 95th percentile of the empirical distribution (as is used in papers), applying that criteria to this plot there should be a horizontal line a little bit avobe the value 1, which is going to represent the p value of 0.05 (95th upper threshold), so when that line touches the points when the Fst values in X axis is 0.42, so thats is my Fst threshold to detect significantly divergent snps. That's the criteria I've been using to select variants.

zeeev commented 9 years ago

Would you like to arrange a video call? It might be easiest.

bibb commented 9 years ago

Off course, we can use google talk, my email is bibb77@gmail.com.

bibb commented 9 years ago

Hello Zev. How can I cite your tools in a publication?

zeeev commented 9 years ago

For now please cite the github page. If you tell me which tools you used I can give you the appropriate citations.

also please feel free to contact me @ zev.kronenberg@gmail.com

bibb commented 9 years ago

I have used wcFST and permuteGPAT++. I would like to cite both, Thanks.

acardilini commented 9 years ago

G'day Zev and Bibb,

I am currently implementing wcFst and then running the new permute function and everything is working well. This conversation has been extremely helpful for me to be able to use both wcFst and permute, however, I am having an issue interpreting the results as bibb expressed a couple of weeks ago.

I get extremely similar plots to the ones Bibb posted and am wondering whether it is appropriate to simply take those points that fall below an empirical p-value of 0.01 as being significantly divergent SNPs (or -log10(p) >= 2).

I ran 10,000 permutations in wcFst and have plotted the wcFst results from several different data treatments.

plot_wcfst

Thank you for your help.

Regards, Adam

zeeev commented 9 years ago

Adam,

I'm happy to hear that wcFst and the permutation are working for you. That is really nice graph. The catch with Fst is it is only a random variable. Permutations will give you the empirical probability of each Fst value within a single dataset. Comparing empirical probabilities across multiple treatments is not valid. Have a look at (http://en.wikipedia.org/wiki/Empirical_probability)

I'm not sure what system you are working in, but I usually like to see a correlation between Fst values and genomic positions. Do you see regions with elevated Fst values? If there is a lot of noise, smoothing might be helpful (GPAT++ : smoother). I could also build a permutation test for regions if that would be useful.

--Zev

zeeev commented 9 years ago

bib - here are the citations:

Weir, B.S., and Cockerham, C.C. (1984). Estimating F-statistics for the analysis of population structure. Evolution (N. Y). 1358–1370.

how to cite a github repo: http://academia.stackexchange.com/questions/14010/how-do-you-cite-a-github-repository

acardilini commented 9 years ago

G'day Zev,

Thanks for the response and advice, I hadn't intending to compare probabilities across treatments, it was just easier to include everything in the same graph. If I am understanding correctly, the empirical probability is simply an indication of how likely you are to get the Fst value (which is subject to variation due to chance) for a given SNP. So basically, I can be more confident in the Fst value of SNPs with low empirical probabilities.

I am using genotype-by-sequencing SNP data with no reference genome so unfortunately I don't have any information about genomic position. It's not ideal...

Thanks again, Adam

bibb commented 8 years ago

Hi Zev,

I'm coming back to this analysis after a while. I'm having troubles understanding the permutation procedure of permuteGPAT++ (I'm not an statistician). I know that an empirical probability is the chance of finding a value within the data (as the software documentation says), so then what is being used in the permutation? I know that the software is not permuting individuals (e.g. target-background population individuals), so what is used?, what do you mean with number of trials and successes?. I would really appreciate your help in this, I'll look forward to your answer.

Cheers

Bernabe

zeeev commented 8 years ago

@bibb lets say you've run wcFst and then permuteGPAT++. The output is the empirical probability of observing a value greater than the input.

example:

Observed Fst value : 0.7

random sample from the same dataset: 0.1 random sample from the same dataset: 0.75 random sample from the same dataset: 0.2

The empirical probability is 1/3.

The number of successes is 1.

The number of trials is 3.

If you want to shuffle individuals you will have to write small script to randomly subset the target and background. I like the idea of shuffling the individuals, i will add it to the list of thing to implement.

sorry for the late reply.

--Zev