xf-omics / SHINE

prediction of pathogenicity for inframe indels
3 stars 0 forks source link

The test dataset provided is larger than what is described in the paper #4

Open GriffithLin opened 1 year ago

GriffithLin commented 1 year ago

I would kindly like to inquire about obtaining the genuine testing dataset for SHINE. It appears that the number of samples in the provided testing dataset exceeds the quantity mentioned in the published article.

xf-omics commented 1 year ago

"Overlapping indels with the training-validation datasets were removed from the test datasets". Did you remove them?

GriffithLin commented 1 year ago

I understand that you have removed the overlapping samples from the test dataset as mentioned. However, I still have not obtained a test dataset that aligns with the quantity mentioned in the published article. As the test dataset is crucial for my research, it is important to have a dataset that matches the intended quantity.

Therefore, I kindly request your assistance in either providing the correct test dataset or sharing detailed steps on how the test dataset was processed, so that I can replicate the process and obtain the accurate test dataset.

Thank you for your attention to this matter.

xiaofan-lab commented 1 year ago

The test data provided in this github are the original individual-level data, so researchers can use them for different purposes. If you would like to replicate our test datasets:

  1. Remove variants from training and validation datasets
  2. Remove variants with conflicting interpretation, which are carried by both affected and unaffected individuals including UKBB
  3. The numbers reported in the manuscript are for unique variants, however, the test datasets here include all individual variants, i.e. variants may be repetitive.