Is the sstar can compatible with all reference genome?

zhang919 commented 1 year ago

Hi, Thanks for development sstar, I konw the S* fisrt version (freezing-archer you mentioned) is base on hg19. is the sstar can compatible with all reference genome?

Best regards. Zhang

xin-huang commented 1 year ago

Hi @zhang919

No reference genome is hard-coded in sstar. Therefore, you can use any reference genome for your data.

zhang919 commented 1 year ago

Hi, Thanks for your quickly response, It's a cool work to develop a friendly pipeline for S*. I read your paper in Molecular Biology and Evolution, may i ask some more questions？

Did you reimplement all S (freezing-archer) code? if so, is the sstar can get consistent output with S if I run them in data same as Vernot et al, Science, 2016.
The code of S* is hard to read and understand, he uses a lot of incomprehensible intermediate data, for example .bbg and .bsg with a list of custom tools and hard-coded parmas (such as reference genome version), How do you handle this part of the workflow？

Best regards. Zhang

xin-huang commented 1 year ago

Sure.

I think there are three steps in freezing-archer.

Calculating S* scores without a source genome and determine whether a score is significant or not.
Calculating match percentages and p-values for match percentages with a source genome and a reference panel.
Calculating posterior probabilities with p-values for match percentages to determine whether a fragment is from Neanderthals or Denisovans.

For S* scores, I believe they are consistent from both sstar and freezing-archer. For match percentages, the calculation is the same. However, in freezing-archer, several conditions are hard-coded to remove fragments, for example, https://github.com/bvernot/freezing-archer/blob/master/bin/s_star_fns.py#L215 (this line is also not correct, it should be fixed by if (test_len < 10000) or (test_mapped_bases_bin < 5000) or (test_tot_sites < 8):). So the results may be not consistent. I removed the calculation for p-values and posterior probabilities. Because I found this approach is not stable from my experiments.

zhang919 commented 1 year ago

Thank you for your answer, sorting out the workflow of S* is a cumbersome job（I'm trying but it's too complicated), sstar is a great work, it is very helpful for my ongoing work. I noticed that the calculation of p-value and posterior probabilities is an important improvement in Vernot et al, Science, 2016 compared to Vernot et al, Science, 2014. If this does not involve your still unpublished work, can you share some details on how it became unstable

xin-huang commented 1 year ago

Sorry, the experiment results were lost due to the data migration in our cluster, but I can briefly describe my observation.

To obtain the posterior probabilities, freezing-archer (https://github.com/bvernot/freezing-archer/blob/master/bin/LL_analysis_fn.R#L18) uses the kde2d fuction to interpolate p-values from Neanderthals and Denisovans.

One important parameter is the bandwidth in the kde2d function. They set the bandwidth to 1.5 for the null hypothesis (no introgression) and 6 for the alternative hypothesis (introgression) (https://github.com/bvernot/freezing-archer/blob/master/bin/pval_LL_methods_pick_a_model.R#L124). It is unclear why they chose these two values.

Another issue is raised by this line https://github.com/bvernot/freezing-archer/blob/master/bin/s_star_fns.py#L267 One is added in both the numerator and denominator to avoid ZeroDivisionError. I removed 1s in both the numerator and denominator to get p-values. However, I got totally different results for the posterior probabilities, indicating this approach is not numerically stable.

zhang919 commented 1 year ago

Hi, Huang

Very clear answer！I'm also annoyed by hardcoded parameters in S*，I basic agree with you in paragraph 2 and paragraph 3:

To obtain the posterior probabilities, freezing-archer (https://github.com/bvernot/freezing-archer/blob/master/bin/LL_analysis_fn.R#L18) uses the kde2d fuction to interpolate p-values from Neanderthals and Denisovans.

One important parameter is the bandwidth in the kde2d function. They set the bandwidth to 1.5 for the null hypothesis (no introgression) and 6 for the alternative hypothesis (introgression) (https://github.com/bvernot/freezing-archer/blob/master/bin/pval_LL_methods_pick_a_model.R#L124). It is unclear why they chose these two values.

But, I have a different opinion on the last point，I think there is not for avoid ZeroDivisionError, loosely speaking, this is to calculate an empirical P value (in https://github.com/bvernot/freezing-archer/blob/master/bin/s_star_fns.py#L267), In fact, adding one to both the numerator and denominator is a correction for the P value [1]. Although controversial[2], it is an acceptable practice[3], and widely using

[1]. AC Davison et. al., Bootstrap methods and their application. No. 1. Cambridge university press, 1995. [2]. B.V. North et. al., 1997, doi: 10.1086/341527 [3]. B.V. North et. al., 2003, doi: 10.1086/346173

xin-huang commented 1 year ago

Yes, you are right. They are empirical p-values. I remembered most p-values did not differ too much using r/n or (r+1)/(n+1).

zhang919 commented 1 year ago

Yes, there are no significance differ p-values, especially when n is large, the two are almost identical. Practically, their differences can be ignored when n and r is large from a big set. In fact, I remembered r/n is also another calculation formula, I am not have the specific literatures, now, but there should be reports like this

xin-huang commented 1 year ago

Anyway, it needs more effort to test their approach for determining the origin of an introgressed fragment. Currently, I am focusing on ghost introgression, so I don't have the bandwidth to investigate it.

I will close this issue. Please feel free to reopen it or open a new issue if you have further questions to discuss. Thank you.

xin-huang / sstar

Is the sstar can compatible with all reference genome? #17