zmaroti / correctKin

GNU General Public License v3.0
2 stars 0 forks source link

running correctKIN with empty results #3

Closed gilibean closed 1 year ago

gilibean commented 1 year ago

I've been trying to run correctKIN on a group of 28 individuals, which share varying degrees of kinship (1sth - 6th degree), which have been identified with other tools.

Starting with a .vcf file with the genotypes of the individuals (in which all genotypes are already haploids), I have converted it to plink format, ran the markerOverlap tool to calculate the pairwise marker overlap fraction matrix. I have also ran pcangsd over the same genotype file in plink format to create a csv covariance matrix, which I then converted to the .npy format. These two matrices were then given as input to filterRelates tool and it outputted the corrected covariance matrix (containing the correct sample names), and also the summary statistics: bin ID min overlap frac max overlap frac median overlap frac mean (corr. coeff) SD (corr. coeff) 6.00 sigma (corr. coeff) 95% confidence (corr. coeff) bin size 0 0.035412 0.998000 0.726281 -0.094283 0.390245 2.341468 0.764880 378

however - for the kinship I only got the title line:

kin group ID1 ID2 uncorr. kin. coeff. overlap frac. corr. kin. coeff. 6.00 sigma threshold 95 conf (lower) 95 conf (upper) est. relatedness

Can you tell me what I did wrong? I am attaching the corrected covariance matrix and am happy to send you any of the input/output files and logs. 35neolethic.4beagleP.corr.tsv.txt

Thank you

zmaroti commented 1 year ago

Hi,

The PCRelate algorithm that is implemented in angsd package to estimate the uncorrected kinship coefficients is using PCA to to identify related and unrelated individuals and to regress out similarities due to IBS from the most appropriate reference subset of your data set to identify the most probable amount of IBD between the pair of tested samples. In case your data set contains mostly related people from a large family, it can happen that identification of unrelated and related subgroups in your data set is improper or there are not enough unrelated individuals to use as a reference. Which leads to unregressed IBS and falsely estimated kinship coefficients.

In the quickstart we described that the best result can be achieved when your test individuals are co-analyzed with unrelated public data, where the avegrage genome coverage and the included reference individual genetic makeup is similar to your test individuals. To find the best matching controls you could use fastNGSAdmix PCA, or prior knowledge to select a dozen of controls as a reference. The method is pretty much scalable, so unless you include largely different genome structure individuals as a reference there is no draback and it helps to regress true IBS better as PCRelate would automatically identify the best reference from your data set for the analyzed samples.

Since I had seen that the error model and the 6 sigma values are unexpectedly high and also your corrected kinship coefficient matrice contains very large negative values that should not happen in any normal circumstances even at very low coverage data it could happen that your problem here is the lack of unrelated reference included in your data set.

Also keep in mind that VCF files are not best for getting the proper genotypes at the 1240K positions. In case of modern fully typed data, especially from a JOINT vcf file containing a cohort of samples it is less of of an issue. However, based on your genotype call approach it can happen that the missing data (no read at the given position) and hom reference alleles are not emitted into your individual VCF files, as VCF is a reference based format where usually only variants compared to the reference are emitted. Thus the imported PLINK data set from such VCF files could be WRONG, especially for sparse aDNA genotypes. Unfortunately PLINK is not reference based but minor/major based thus importing, merging data into PLINK has to be done with keeping this in mind! Especially when the data is coming from different sources, where it can happen that without the --keep-allele-order option the minor/major alleles are flipped based on the actual minor allele of the included samples.

Without the actual data set, and the workflow/commands you used it is hard to tell what is the underlying problem.

I'd suggest, if the original BAM files or the 1240K eigenstrat/plink data sets are available for the given samples it is best to start from that using the guide in our quickstart document. Also as suggested above please merge some population/genome striucture matching unrelated individuals to your data set, so the PCArelate algorithm can regress out IBS best. Adding more reference data to your dataset also help to make a better error model for filterRelates.

gilibean commented 1 year ago

Thank you for your quick and elaborated reply. The dataset contains individuals from several different families, only partly related to each other. In addition, 7 individuals of the 28 are not related to anyone. So I would think there is enough 'unrelatedness' in the dataset to provide negative control for the algorithm. What you are saying about plink conversion from the .vcf is interesting. I will look more closely into that. Even though, I have converted .vcf to plink before, for example, for the KING algorithm, and this worked fine and provided expected results.

zmaroti commented 1 year ago

Looking through your description I am a bit puzzled about the kinship covariance matrix part. If you are using the version 0.99 pcangsd that is required to run the uncorrected kinship coefficient estimation, you will have your output as a numpy file already.

When you issue the command

python pcangsd.py -plink 35neolethic.4beagleP -o 35neolethic.4beagleP -inbreed 1 -kinship

(considering that your plink data set has a prefix of 35neolethic.4beagleP)

You should get two numpy file as an output: 35neolethic.4beagleP.kinship.npy and 35neolethic.4beagleP.inbreed.py

Next step is using filterRelates 35neolethic.4beagleP.overlap 35neolethic.4beagleP.kinship.py

So there should be no step to convert covariance matrices to a numpy format. Are you sure you are using the pcangsd 0.99 version with the proper options?

If you have still issues please contact me privately at the corresponding author email of our manuscript. Looking through your corrected kinship coefficients the values seems to be off. To improve our workflow it would be very useful for us to debug the cause and correct it or at least indicate at which scenario we have such trouble and what is the solution to fix it.

Thank you for advance, Zoltan

zmaroti commented 1 year ago

I am closing the issue as it turned out that wrong version of pcangsd was used and kinship coefficient estimation was not performed with the recommended way. As it turned out, user path installation of the required pcangsd version 0.99 is not trivial so I will update the quickstart to include the commands to build pcangsd locally:

NOTE: You need the developmental python package (python3-devel or python3-dev based on your OS).

wget https://github.com/Rosemeis/pcangsd/archive/refs/tags/v.0.99.tar.gz
tar xzf v.0.99.tar.gz
cd pcangsd-v.0.99
pip install --user -r requirements.txt
python setup.py build_ext --inplace

This will build pcangsd version 0.99 in the pcangsd-v.0.99 subdirectory. Running the tool from the local subdirectory is done like: python pcangsd-v.0.99/pcangsd.py --help

NOTE: if your default python version is not python3.x change pip to pyp3, and python to python3.