wenhao-gao / mol_opt

MIT License
183 stars 39 forks source link

Discrepancies in Reproducing Results for Hyperparameter Tuning in REINVENT #28

Open ankur56 opened 11 months ago

ankur56 commented 11 months ago

Subject: Inconsistencies in Replicating Results from Supplemental Information (Figures 14 and 15, Section D.2)

Hello,

We've been working on reproducing the results for hyperparameter tuning in REINVENT, specifically for the zaleplon_mpo and perindopril_mpo oracles as presented in Figures 14 and 15, Section D.2 of the supplemental information. Despite following the installation and execution instructions in the README, our results differ from those published.

Issue Details:

  1. Discrepancy in Mean AUC Top-10 for zaleplon_mpo:

    • Published Result: Table-4 reports a mean AUC Top-10 of 0.358±0.062 across five independent runs.
    • Our Result: We observed a mean AUC Top-10 of 0.503±0.02.
  2. Performance Difference Between Sigma Values:

    • Published Behavior: A significant performance difference is reported between sigma values of 500 and 60 (Figure 14, Section D.2).
    • Our Observation: We found minimal performance difference between these sigma values (mean AUC Top-10 of 0.503 for sigma=500 vs 0.482 for sigma=60) for zaleplon_mpo.
  3. Other Discrepancies: We also noted discrepancies in several mean AUC Top-10 values reported in Table-4.

Seeking Clarification:

We would like to thoroughly analyze the behavior of the hyperparameter sigma and ensure the accuracy of our results. Could you please help us verify that our methodology aligns with your implementation? We want to ensure that there are no overlooked mistakes on our end or potential bugs in the code.

Any insights or suggestions you could provide would be greatly appreciated.

Thank you for your assistance.

MorganCThomas commented 11 months ago

I noticed this as well, so I looked into it and found that there were some bug fixes in TDC after the benchmark was published that affected the following oracles: zaleplon_mpo, sitagliptin_mpo, C11H24, C9H10N2O2PF2Cl. This has lead to inconsistent benchmark results since publication, especially considering zaleplon_mpo was used for hparam tuning.

For now when comparing benchmark results I omit these oracles, but this also raises the question, are you planning to update the publication with corrected results?

ankur56 commented 11 months ago

@MorganCThomas Thank you for your reply. We have also noticed a minor deviation in the reported values for the perindopril_mpo oracle, which was another target for the hparam search in the publication. Therefore, it might not be accurate to conclude that sigma=500 is the optimal value for these oracles when compared to the sigma values of 60 and 120 used in previous REINVENT studies.

MorganCThomas commented 11 months ago

Good to know thanks. I've assessed REINVENT before here and here (Fig 3e-g) and I would never recommend using sigma=500, I didn't even think to test a value this high based on personal experience.