yuqianghan / editretro

Retrosynthesis Prediction with an Iterative String Editing Model
MIT License
8 stars 3 forks source link

get_ranked_topk cannot be reproduced #6

Closed 1121091694 closed 1 month ago

1121091694 commented 2 months ago

Great work! Could you give me a sample file input for get_ranked_topk.py? I can't reproduce the data augmentation results of your work. And why not just calculate the top when the data augmentation test set is 20, that is, for example, a test data is augmented to 20 items, and 20 prediction data are generated based on these 20 items. As long as there is 1 item in these 20 items that is consistent with the test set, it is the top. Why use such a complex ranked? Thanks for your answer.

yuqianghan commented 2 months ago

The _get_rankedtopk.py usage example can be found in the interaction folder. It is utilized to retrieve the top-k predictions for a given molecular SMILES input.

During the inference process of a molecule, each augmented SMILES is treated independently, resulting in each generating a beam (e.g., 20) predictions. Consequently, there are a total of beam augs (e.g., 20 20) candidates. The purpose of get_ranked_topk.py is to calculate the score (an alternate of the probability) of each candidate and arrange them in a ranking based on this score.

A key observation is that there might be duplicate predictions across different augmented SMILES. The scoring function computes the score considering the ranking and frequency of candidates. For a more in-depth analysis, please refer to the Augmented Transformer [1].

[1] Tetko, Igor V., et al. "State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis." Nature Communications 11.1 (2020): 5575.