Closed zhang919 closed 1 year ago
Hi @zhang919
No reference genome is hard-coded in sstar
.
Therefore, you can use any reference genome for your data.
Hi, Thanks for your quickly response, It's a cool work to develop a friendly pipeline for S*. I read your paper in Molecular Biology and Evolution, may i ask some more questions?
.bbg
and .bsg
with a list of custom tools and hard-coded parmas (such as reference genome version), How do you handle this part of the workflow?Best regards. Zhang
Sure.
I think there are three steps in freezing-archer
.
For S* scores, I believe they are consistent from both sstar
and freezing-archer
.
For match percentages, the calculation is the same. However, in freezing-archer
, several conditions are hard-coded to remove fragments, for example, https://github.com/bvernot/freezing-archer/blob/master/bin/s_star_fns.py#L215 (this line is also not correct, it should be fixed by if (test_len < 10000) or (test_mapped_bases_bin < 5000) or (test_tot_sites < 8):
). So the results may be not consistent.
I removed the calculation for p-values and posterior probabilities. Because I found this approach is not stable from my experiments.
Thank you for your answer, sorting out the workflow of S* is a cumbersome job(I'm trying but it's too complicated), sstar is a great work, it is very helpful for my ongoing work. I noticed that the calculation of p-value and posterior probabilities is an important improvement in Vernot et al, Science, 2016 compared to Vernot et al, Science, 2014. If this does not involve your still unpublished work, can you share some details on how it became unstable
Sorry, the experiment results were lost due to the data migration in our cluster, but I can briefly describe my observation.
To obtain the posterior probabilities, freezing-archer
(https://github.com/bvernot/freezing-archer/blob/master/bin/LL_analysis_fn.R#L18) uses the kde2d fuction to interpolate p-values from Neanderthals and Denisovans.
One important parameter is the bandwidth in the kde2d
function. They set the bandwidth to 1.5 for the null hypothesis (no introgression) and 6 for the alternative hypothesis (introgression) (https://github.com/bvernot/freezing-archer/blob/master/bin/pval_LL_methods_pick_a_model.R#L124). It is unclear why they chose these two values.
Another issue is raised by this line https://github.com/bvernot/freezing-archer/blob/master/bin/s_star_fns.py#L267
One is added in both the numerator and denominator to avoid ZeroDivisionError
. I removed 1
s in both the numerator and denominator to get p-values. However, I got totally different results for the posterior probabilities, indicating this approach is not numerically stable.
Hi, Huang
Very clear answer!I'm also annoyed by hardcoded parameters in S*,I basic agree with you in paragraph 2 and paragraph 3:
To obtain the posterior probabilities, freezing-archer (https://github.com/bvernot/freezing-archer/blob/master/bin/LL_analysis_fn.R#L18) uses the kde2d fuction to interpolate p-values from Neanderthals and Denisovans.
One important parameter is the bandwidth in the kde2d function. They set the bandwidth to 1.5 for the null hypothesis (no introgression) and 6 for the alternative hypothesis (introgression) (https://github.com/bvernot/freezing-archer/blob/master/bin/pval_LL_methods_pick_a_model.R#L124). It is unclear why they chose these two values.
But, I have a different opinion on the last point,I think there is not for avoid ZeroDivisionError
, loosely speaking, this is to calculate an empirical P value (in https://github.com/bvernot/freezing-archer/blob/master/bin/s_star_fns.py#L267), In fact, adding one to both the numerator and denominator is a correction for the P value [1]. Although controversial[2], it is an acceptable practice[3], and widely using
[1]. AC Davison et. al., Bootstrap methods and their application. No. 1. Cambridge university press, 1995. [2]. B.V. North et. al., 1997, doi: 10.1086/341527 [3]. B.V. North et. al., 2003, doi: 10.1086/346173
Yes, you are right. They are empirical p-values.
I remembered most p-values did not differ too much using r/n
or (r+1)/(n+1)
.
Yes, there are no significance differ p-values, especially when n is large, the two are almost identical. Practically, their differences can be ignored when n and r is large from a big set. In fact, I remembered r/n
is also another calculation formula, I am not have the specific literatures, now, but there should be reports like this
Anyway, it needs more effort to test their approach for determining the origin of an introgressed fragment. Currently, I am focusing on ghost introgression, so I don't have the bandwidth to investigate it.
I will close this issue. Please feel free to reopen it or open a new issue if you have further questions to discuss. Thank you.
Hi, Thanks for development sstar, I konw the S* fisrt version (freezing-archer you mentioned) is base on hg19. is the sstar can compatible with all reference genome?
Best regards. Zhang