wilkinghoff / icassp2023

Accompanying code for the paper Design Choices for Learning Embeddings from Auxiliary Tasks for Domain Generalization in Anomalous Sound Detection.
GNU General Public License v3.0
8 stars 3 forks source link

Some questions about the AUC & pAUC between different domain #2

Closed A-New-Page closed 1 year ago

A-New-Page commented 1 year ago

Thank you for your meaningful work.

In the code, source AUC and target AUC are calculate using data from the corresponding domains only.

But actually, the database you used (DCASE2022) focuses on domain generalization, the source AUC should be calculated using all data from the source domain and anomaly data from the target domain. The same is true for the source pAUC, target AUC and target pAUC.

Detailed official evaluation metrics and explanations can be found in https://dcase.community/challenge2022/task-unsupervised-anomalous-sound-detection-for-machine-condition-monitoring#evaluation

Do you have any results on these? Or tell me if I have misunderstood the evaluation metrics.

Thanks so much.

wilkinghoff commented 1 year ago

Thank you for your interest. I am not sure if I understand your question correctly.

As you have stated, for each machine type (and the means over several types), I am calculating all performance measures using only the samples belonging to the source and target domain individually, but I am also evaluating the performance by jointly using data from both domains, which, for domain generalization, is the whole purpose of the system. The individual calculations are just to have additional insights on how the system performs as it does.

For example, when considering the following lines of code, you can see that I first output the results for the source domain, followed by the ones for the target domain and finally for both domains together:

mean_auc_source = hmean(aucs_source)
print('mean AUC for source domain: ' + str(mean_auc_source * 100))
mean_p_auc_source = hmean(p_aucs_source)
print('mean pAUC for source domain: ' + str(mean_p_auc_source * 100))
mean_auc_target = hmean(aucs_target)
print('mean AUC for target domain: ' + str(mean_auc_target * 100))
mean_p_auc_target = hmean(p_aucs_target)
print('mean pAUC for target domain: ' + str(mean_p_auc_target * 100))
mean_auc = hmean(aucs)
print('mean AUC: ' + str(mean_auc * 100))
mean_p_auc = hmean(p_aucs)
print('mean pAUC: ' + str(mean_p_auc * 100))

The output for each individual machine type is computed similarly. I hope that this clarifies your question.

A-New-Page commented 1 year ago

Thanks for your reply!

I agree that the AUC and pAUC calculated by using both domains' data can indeed be used for domain generalization.

But, my concern is that the calculation of these evaluation metrics is different from the official metrics.

The official source AUC, target AUC, and pAUC are calculated by:

     # Extract scores for calculation of AUC (source) and AUC (target)
      y_true_s_auc = [y_true[idx] for idx in range(len(y_true)) if domain_list[idx]=="source" or y_true[idx]==1]
      y_pred_s_auc = [y_pred[idx] for idx in range(len(y_true)) if domain_list[idx]=="source" or y_true[idx]==1]
      y_true_t_auc = [y_true[idx] for idx in range(len(y_true)) if domain_list[idx]=="target" or y_true[idx]==1]
      y_pred_t_auc = [y_pred[idx] for idx in range(len(y_true)) if domain_list[idx]=="target" or y_true[idx]==1]

      # extract scores for calculation of precision, recall, F1 score for each domain
      y_true_s = [y_true[idx] for idx in range(len(y_true)) if domain_list[idx]=="source"]
      y_pred_s = [y_pred[idx] for idx in range(len(y_true)) if domain_list[idx]=="source"]
      y_true_t = [y_true[idx] for idx in range(len(y_true)) if domain_list[idx]=="target"]
      y_pred_t = [y_pred[idx] for idx in range(len(y_true)) if domain_list[idx]=="target"]

      # calculate AUC, pAUC, precision, recall, F1 score 
      auc_s = metrics.roc_auc_score(y_true_s_auc, y_pred_s_auc)
      auc_t = metrics.roc_auc_score(y_true_t_auc, y_pred_t_auc)
      p_auc = metrics.roc_auc_score(y_true, y_pred, max_fpr=param["max_fpr"])

This code is from the https://github.com/Kota-Dohi/dcase2022_task2_baseline_ae/blob/main/01_test.py, which is the baseline of DCASE2022 Task2.

I want to discuss with you the difference between these different metrics. I hope this can explain my problems.

Thank you

wilkinghoff commented 1 year ago

Thanks for the clarification! All threshold-independent performance measures (AUC and pAUC) are included in the script. I only added additional ones because I wanted to have a more complete picture when developing the system. The official score of the challenge "is given by the harmonic mean of the AUC and pAUC scores over all the machine types, sections, and domains", which is the final score I am printing.

But you can also use the official evaluation script by using the output files created in the script. This is also what I did for comparing the performance of the system to other published systems in the paper.

A-New-Page commented 1 year ago

Thank a lot. I have understood the code you said. Thank you again.