Question about top auc metric

vincrichard commented 1 year ago

Thank you for sharing your code and your work on benchmarking.

I have a question regarding the top_auc metrics especially this part: https://github.com/wenhao-gao/mol_opt/blob/caa98f339fb11b3a7b12f3fd5df1422daea27fb2/main/optimizer.py#L46-L48.

If I understand correctly, considering we have a dictionary of smiles and their corresponding scores, the conditions will always evaluate to true when the uniqueness is less than 1. This could potentially create issues if the model generates a large number of duplicate molecules.

One edge example would be a model for which when he generates a smiles, will necessarily generate 2 other identical smiles. This means the model will have 1/3 of the 10,000 molecules unique. Let's assume this model learns and so the score of each new unique molecule generated increases across oracle calls. In this case I think the current implementation of the top_auc include a biais. I think it's clearer with graphs: top_auc = 0.5831 top_auc = 0.4995

values added to sum consist of [(top_k_scores + prev) / 2] * update_frequency in the for loop (len(scores) - limit_update) * [(top_k_scores + prev) / 2] and (len(results) - len(scores)) * [top_k_scores].

In this case, I think it is apparent that removing duplicates creates a bias to the top_auc computation. I am wondering if I am missing something and if this is expected? Could you enlighten me on this ? What is the intuition behind it ?

wenhao-gao commented 1 year ago

Thank you for your comment. The storage buffer for molecules in the form of canonical SMILES is implemented as a dictionary. This ensures that identical molecules do not appear more than once.

vincrichard commented 1 year ago

@wenhao-gao Thank you for your answer. My main concern is that it does not track molecules appearing more than once. As I tried to explain above, the fact that you are keeping only one identical molecule and not every molecule creates bias in my opinion. Right now, duplicate molecules are treated as if they have the best score of the run. which creates the final plateau on the graph above.

Having every molecule makes it so that a model which increases its score linearly over time has a score of 0.5 (in this corner case we are reaching the best score at the last call). A score of 0.5 for this corner case makes sense to me. As I understand it, the top auc will measure how quickly a model will converge to a high score. So linear convergence can be seen as a random scenario for regular auc, ie one of the worst case scenarios. The issue with the current top auc is that the score is of 0.5 only if there are no duplicates.

Another example would be if the model stagnates on a subset of molecules at a small score. and then manage to reach high scoring molecules at the end. This will look like the following graph. You can clearly see that the plateau is not taken into account for the current auc. And this plateau could have been at the start with the score of 0 or at the end with a score of 1, the top_auc metric would not have changed.

top_auc = 0.959725

top_auc = 0.584775

I understand the examples I am showing are corner cases since they present a lot of molecule duplicates. However, I think this issue might impart the score in real cases as well, it is just less visible. Once again, I am not 100% confident which is why I would like your opinion on this. If you could explain your reasoning for keeping only the duplicates it would be appreciated.

wenhao-gao / mol_opt

Question about top auc metric #24