npsonis commented 3 weeks ago

Hi, after sucessfully creating the overlap and the two npy files I get an empty output (only header) after the final step. Note that I am using pseudohaploids made with pileupCaller instead of ANGSD and another list of autosomal SNPs (5M.auto of Koptekin etal 2023 instean of 1240K, which after maf filtering ~1.7M sites are used). I changed the fam file to include genetic sex. The overlap file looks like this (intentionally removed samplenames here):

0.417696 0.048045 0.000000 0.104065 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.295744 0.364739 0.309113 0.304920 0.302163 0.330828 0.244136 0.061205 0.350210 0.214533 0.031814 0.124510 0.338656 0.255223 0.231324 0.034371 0.115255 0.342897 0.145665 0.194706 0.198616 0.394226 0.366818 0.415577 0.076866 0.091260 0.079383 0.076192 0.079682 0.084651 0.064128 0.018537 0.090614 0.057763 0.008359 0.033965 0.087306 0.066254 0.062898 0.009547 0.032299 0.089204 0.038720 0.052442 0.052278 0.100514 0.093647 0.103850 0.674867 0.589512 0.498984 0.492567 0.486464 0.532217 0.390898 0.097062 0.564064 0.344749 0.051001 0.198245 0.546007 0.412365 0.371899 0.054848 0.183827 0.552718 0.233425 0.313095 0.319687 0.635947 0.592526 0.671521 0.000000 0.858700 0.619121 0.623239 0.595695 0.667425 0.482216 0.113453 0.698306 0.419869 0.062887 0.241475 0.677077 0.510004 0.450456 0.065622 0.220905 0.682195 0.286721 0.381133 0.394494 0.791574 0.738671 0.850625 0.000000 0.000000 0.708661 0.517409 0.508380 0.556946 0.407821 0.099766 0.590388 0.359371 0.053294 0.206448 0.571709 0.431639 0.387021 0.056721 0.190235 0.577871 0.243319 0.326760 0.334020 0.666419 0.621206 0.705043 0.000000 0.000000 0.000000 0.717950 0.497484 0.557472 0.402058 0.094536 0.583331 0.350340 0.052481 0.201150 0.565241 0.425360 0.375770 0.054526 0.184023 0.569721 0.239301 0.318527 0.328479 0.661271 0.617153 0.710987 0.000000 0.000000 0.000000 0.000000 0.680662 0.540692 0.399919 0.101725 0.575995 0.354233 0.052376 0.204728 0.556865 0.420893 0.382975 0.056783 0.190578 0.564777 0.239270 0.321947 0.327489 0.647146 0.602273 0.678243 0.000000 0.000000 0.000000 0.000000 0.000000 0.769570 0.441656 0.106759 0.631472 0.383182 0.057323 0.222916 0.611398 0.461195 0.412347 0.060521 0.204944 0.617836 0.262179 0.346786 0.359160 0.714154 0.664965 0.763091 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.553793 0.082323 0.464132 0.285227 0.042286 0.166950 0.448705 0.339113 0.308213 0.045833 0.154859 0.455157 0.194388 0.257573 0.265549 0.522322 0.485624 0.550971 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.129500 0.114564 0.074838 0.010936 0.044692 0.109921 0.083508 0.081750 0.013076 0.043631 0.113125 0.050049 0.067516 0.067175 0.126042 0.117346 0.129247 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.800279 0.408216 0.060615 0.235656 0.647197 0.488222 0.439878 0.064909 0.217984 0.655324 0.276812 0.370371 0.379686 0.754341 0.702245 0.796495 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.479530 0.037247 0.147185 0.394192 0.298718 0.274496 0.041138 0.137776 0.400827 0.170999 0.230217 0.233562 0.457078 0.425572 0.477910 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.071826 0.021601 0.058548 0.044313 0.040381 0.006092 0.020318 0.059487 0.025199 0.033836 0.034617 0.068056 0.063432 0.071499 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.276691 0.227342 0.172216 0.159676 0.024160 0.081435 0.231518 0.099566 0.132777 0.135996 0.263749 0.245147 0.275622 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.775701 0.474118 0.425120 0.062548 0.210256 0.634257 0.267530 0.357370 0.368163 0.730599 0.680331 0.772016 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.583310 0.322176 0.047322 0.158922 0.478707 0.201742 0.270317 0.278381 0.550532 0.512934 0.580806 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.514235 0.044947 0.150406 0.432628 0.184472 0.249026 0.252769 0.491825 0.458149 0.512696 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.074769 0.023066 0.063923 0.027325 0.037440 0.037485 0.072016 0.067131 0.074592 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.252565 0.214463 0.092863 0.125009 0.126053 0.242671 0.225650 0.251858 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.781273 0.271452 0.363434 0.372327 0.738380 0.687038 0.777935 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.328630 0.154693 0.158310 0.310934 0.289364 0.327081 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.434904 0.211895 0.414211 0.386259 0.433350 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.452112 0.427062 0.397533 0.450046 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.908922 0.794016 0.904367 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.847777 0.842918 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.982903

I repeated the analyses by changing the fam file and making all samples having the same family ID (population) but I got the same result.

zmaroti commented 1 week ago

Hi,

Could you please describe the workflow? (Just the commands).

// I am only guessing As I read pileupCaller can output to eigenstrat format. So I guess you performed haploid calls with it and then used convertf to change to plink, then used markeroverlap counts, and pcangsd to perform the uncorrected kinship estimates. And lastly you used filter relates and it gave you just the header?

It could be that you got no relatives at all at the default 6SD threshold. You could try to add the "-matrix" option and save your corrected/uncorrected kins hip coefficients (upper/lower triangles) in a matrix.

The correction of the kin coeff and the statistically difefrent from unrelated kin coeff are two things. The correctKin method needs sufficient number of unrelated individual pair combinations in the test data to work best. The post filtration estimates the distribution of unrelated kin coeffs in your test data (that is factor of coverage, genotyping erro rate, etc), and only outputs the likely kins that are not fitting in that distribution.

The overlap shows that hopefully oyur data was imported/transfomed properly.

However, it also shows you have very low number of individuals. With this method it is recommended to include unrelated controls, the more the better. Unless the population is some very distinct population your control does not have to be the same (for example any european is good for other european, or east asians for east asians... although the better the match the better the analysis). Our results suggests (as discussed in the manuscript) that ~50 individuals is usually sufficient, but the method is very scalable and if you are unsure whether there are admixed individuals adding 100 or lot more samples will be fine. You only have to avoid adding totally different population structure individuals (ie Africans when you analyse Europeans) as they will form new PCA axes that are irrelevant to your data and makes the estimations and regression less precise. And lastly, in case your extra controls are fully typed modern data, it is recommended that you random thin the control to the range of genotyped markers that is typical in your ancient data. So the method can incorporate the uncertainity due to partial data in the coeff estimations.

npsonis commented 1 week ago

Hi zmaroti, thanks for the reply.

The workflow is as you decribed it (albeit pileupCaller can output plink-compatible files, too). To be honest, I was not aware that the software needs >50 individuals (Perhaps you may add this info in the README file). I used a dataset of 26 in which there are at least 3 cases of 1st-2nd degree of genetic relationships, no 3rd-degree (according to other kinship tools), but I suspect based on IBDs that there might be a couple of 4th degree (or larger). So, I though that this was a good dataset to test your tool that according to the description can infer up-to 4th degree.

I now used additional individuals from another study period, but same geographic area (more all less all main 3 ancestral components of Europeans are present on all individuals). It worked (I got results, but I might need to remove related individuals within the second study, as I get only those as a result).

(Perhaps to increase the ease-of-use of your tool, maybe you can provide the 1240K European (E Asian, African etc) overlap and npy files with 100 unrelated individuals that can be merged (this will need change to your code) with the users overlap file (in case the user wants to test just a few individuals and not to have to make a huge dataset from scratch). Alternatively, you may change the code in order to get as input a file with specific coefficients that you will have estimated from 100 unrelated individuals 1240K Europeans (E Asians, Africans etc) and provided to the user.

zmaroti commented 1 week ago

In your initial post you wrote that is that you have a more dense marker set. You cannot mix two different sets of data for analyzing together as pcangsd need the same markers to make the PCA for kin coeff estimation with the implemented PCRELATE algorithm. If you have batch effect do to different markers you will not get realistic results (like a PCA is also wrong in such case).

The denser marker set is theoretically better for this analysis. The problem is that there is no easy premade data like the AADR data with 1240K marker set to use as a control or anything so unless your project includes lots of cemetaries where the people are mainly same genome structure and likely lots of unrelated individuals are within your data then you are out of luck. We do have the 1KG GRCh37 joint vcf data on our server, so in case you email me your marker positions and write me which populations should be included I can filter you a subset and transform them to PLINK data format.

If you restrict the data to the 1240K AADR positions then you can easily have your reference controls the AADR data set (from 1KG or any ancient that matches your test individuals best). It is 5.8Gb that shouldn't be a problem on a desktop machine either. You can use convertf to make it packedped (plin binary format) so it can be easily used with correctKin. If it is ancient you can include them into the data set, as most of them will have various genotyping rate and they are also pseudohaploid as well. If modern then you need to reandom pseudohaploidize, and it is also advised to marker thin it randomly to match your test data with the included "depleteIndivs <DATA.(bed,geno)> <minMarker> [maxMarker]" command before merging them to your data.

Please note, that this method (and most other method as well) will give you the most realistic results when

1) you have reference that is population matched,

2) that contains more or less similar genotyping errors, PMD that your test data

3) and lastly that your reference sample data has genotyping fractions similar to your test data

All of these factors that increase PCA uncertainity (genoptyping errors, partial information) will be included in the statistical model of correctKin filterRelates tool. So you can test whether your sample fits into the distributuion of kin coeffs that exists between unrelated individuals, or it is a true relative pair that has significantly larger kin coeff which cannot come from PCA uncertainity due to low marker count, genotyping errors, not fully matching relatives.

Last important NOTE, DON'T USE the 1KG .DG (diploid) data from version v54.1 of the AADR data set. I am not sure if newer version addressed the issue with 1KG diploid data, however it has utter garbage on the sex chromosomes, furthermore it was filtered for a subset of the 1240K marker set (like 800K markers genotyped only, and what worse the same 800K markers for all 1kg .DG samples and this could lead to batch effect and bias). It also had a "PASS_MAYBE_REPLACE_IN_FUTURE_WITH_LESS_FILTERED_VERSION_WITH_MORE_SNPS_COVERED" flag.

On the other hand the same 1KG .SG samples are OK, they are properly genotyped, and the sex chromosome data is also ok (not high quality but at least not garbage like the .DG data).

npsonis commented 1 week ago

Thanks zmaroti. I appreciate the help. You may close this.

zmaroti / correctKin

no results in the final step #4

1) you have reference that is population matched,

2) that contains more or less similar genotyping errors, PMD that your test data

3) and lastly that your reference sample data has genotyping fractions similar to your test data