zmaroti / correctKin

GNU General Public License v3.0
2 stars 0 forks source link

[Question] [Documentation]Any plans to provide with a complete "quick-start" example ? #2

Closed MaelLefeuvre closed 1 year ago

MaelLefeuvre commented 1 year ago

On a related note following issue #1 , I was wondering If there were any plans to add a quick start example for your method's documentation, as I'm having trouble to reproduce some of the results in your paper.

In particular, I'm failing to understand which population(s) and/or dataset(s) (if any) were used as reference populations, when applying your method on archaic samples, such as those found in your first example, using Medieval Samples, or your second example, using Corded Ware culture sampls from (Mathiesson et al. 2015).

In case a specific reference population was indeed used in conjunction with these samples, is there a specific procedure to merge all samples that users should be aware of, or can a trivial merge, using plink or convertf be considered enough ?

Any help / clarification regarding this second matter would be hugely appreciated!

In all cases, thanks in advance for your answer, and thank you very much for this great scientific contribution.

Cheers!

zmaroti commented 1 year ago

There is a quick start document in the manuscript (Additional file 8: Note S1..

However it is not complete. And more use case scenarios should be covered especially for preparing the data set for ancient samples.

For example we should mention when using fully typed modern data as a reference you should random deplete modern samples (depleteIndiv tool) to match the genotyped marker counts of your ancient data. Unless you do this, you will have over optimistic model for the calculated kinship coefficient confidence intervals.

For the Mediveal samples, and the CWC samples you can safely use either the AADR data set with the ~2500 Eurasian samples that had >100k genotyped markers, or you can use marker depleted EUR samples from the 1KG dataset. As described in the manuscript you don't have to specify the reference population for the method to work, you only have to make sure that the reference population exists in your data set. For this either you include all potential and available samples with similar genome structure or you can use fastNGSAdmix PCA to objectively select samples with largely similar genome compositions (note that your test samples might have distinct genome structure not shown in the first few PCA vectors, thus such preselection should be used with care).

We deposited the used 1KG diploid data sets on Zenodo (see manuscript) that were used to analyze the CWC data and we also supplied the selected AADR sample IDs in the monster XLSX that can be also used as a reference. Note, we uploaded the diploid data set, for 1KG samples, thus you need to pseudoHaploidize and also do depleteIndiv on the pseudohaploidized 1KG reference before joining with the test data sets.

MaelLefeuvre commented 9 months ago

Hello Zoltán,

Thank you for your clear answer, and sorry for replying after so many months. Your comment was really helpful and cleared up most of my questions. However, I'm still a bit confused as to what you were implying by the following sentence:

you should random deplete modern samples (depleteIndiv tool) to match the genotyped marker counts of your ancient data.

By "match the genotyped marker count", do you imply matching the average number of non-missing genotypes within the ancient samples, or rather matching the number of positions found on the target SNP panel using during the study (i.e.: the AADR 1240K dataset), regardless of the actual coverage of my ancient samples' on said panel ?

I've been trying both approaches with a set of 0.1X-coverage WGS samples (which roughly amounts to an average of 115,000 non-missing autosomal positions on the AADR-1240K dataset). In other terms, I applied correctKIN on both:

  1. A Plink bfileset containing my ancient samples (~115,000 non-missing position on the 1240K panel) + 503 fully typed (~1,150,000 non-missing positions) 1000g-phase3 EUR samples, plucked out from the AADR dataset

  2. A Plink bfileset containing the same ancient samples + 503 random depleted samples 1000g-phase3 EUR samples (~115,000 non-missing positions, after applying depleteIndiv). These reference individuals were also plucked from the AADR dataset

Surprinsingly, approach n°1 gave me much more sound results, despite your recommendation of using depleteIndiv, and I'm thus now wondering If I understood your guideline correctly.

Again, Thank you very much for your previous answer, and your helpful insights on this matter.

Cheers

zmaroti commented 9 months ago

Thank you very much for the comparison, it seems interesting, however I should ask what do you define a "more sound result"?

It seems to me that we are looking at the issue from two different angles :-) My concern is always controlling false positives. Thus for the error model, it is advantegous to have marker depleted reference, as you will have lots of unrelated sample pairs in your data set that have the same marker overlap as you have between the ancient individuals. This helps to assess all the random variations due to missing information and pairwise differences on the whole analysis (PCA, kinship coeff, etc).

(Since our ancient WGS are usually 0.5x or higher, and 2-3x is not rare either we simply run a large cohort of population matched Carpathian basin people and don't need to mix in additional reference pops. We tested earlier to analyze them alone and together and since we got the same results we stick to the more realistic only aDNA dataset especially since we have lots of samples and the method does not have any scaling ussue like some other method where analysing 50-60 indivs are a concern)

However thinking over your use case scenario it may very well happen that it boosts the method a lot if the underlying PCA is as good as it can be thus having the reference modern samples with sufficient genotyping rate does help as it leads to better estimation of the IBS/residuals. In our simulations we used marker depletion between 100k-full set, and had lots of individuals, thus we could simulate marker overlap between 0.01-1 (100k is bit less than 10% so considering a sample pair that is <1% random marker overlap). However this also means that in our setting we had considerable number of high genotype fraction modern samples in our simulations, and likely the PCA on that cohort was pretty accurate that helped the whole method. While if you deplete all references then the PCA can suffer and the whole method will suffer due to the innaccuracy.

So I should likely revise my advice. To try to be good at both side, likely a combined approach would be more sufficient. Ie, have a few modern reference samples from all EUR pop fully genotyped (likely 50 individuals would be more than enough). This will make the underlying PCA analysis part much accurate. And have the rest of reference deplete to match the genotyped marker fraction of your ancient samples. This way you will have a lots of unrelated sample pair with matching marker overlap fraction, so the error model is also good for removing false positives and you don't have an over optimistic model at the relative filtration.

If you happen to do this additional analysis and you can prove this hypothesis I will gladly rewrite the recomendation for the quickstart and include all the important information so we can help others with similar situation.

Thanks!

zmaroti commented 9 months ago

One more thing. I forgot to explicitly answer the "matching genotype" question. So yes I meant that the genotype fraction of your reference data should be similar to your ancient data on the 1240K subset (actually it losely correlates with genome coverages). So if you have ~100K genotyped markers and ~1M missing marker in your samples then in the reference data you should deplete ref individuals so they have similar amount of genotyped markers like your ancient data (at least in a considerable amount of included references, while keeping some with full marker set for better PCA, see above post).

The rationale is that, in the error model we use all pairwise kinship coefficient estimates to calculate the threshold between unrelated and likely related individuals. This threshold is different for different marker overlap bins and is extracted from the empirical estimates. If all of your reference data is fully typed, and we consider that in your ancient samples you have ~100K random marker genotyped then all of the unrelated kin relations represented by sample pairs combinations of all reference indivs vs all ancient will have 10% marker overlap. That is much higher marker overlap than you would have between two ancient individuals where you would expect slightly less than 1% of marker overlap (considering less than ~10% of genotyped marker fraction vs all markers and random markers missing/genotyped in the two samples). Thus the error model will have only statistical estimates on the relevant marker overlap fraction between ancient indivs that exists in the data set between unrelated ancient indivs. Accordingly it will have much less statistical power as it could, when you have additional reference data depleted to the same marker count, providing lot more relevant kin coeff estimated between truly unrelated sample pairs. Note the kin coeff matrix between of all sample pair combination is N*(N-1)/2 thus the number of data you can use in this statistic is ~ quadratic.