Closed dxmxyx closed 3 years ago
There are also problems with the way they calculate F1-socre,
"Suppose this model output of [x1 x2 x3 x4] is [1,3,4,2].Groundtruth is [1,2,2,3] So the normal accuracy of this model is 50%" -> why?
sorry ,my fault. accuracy of all eomtions : acc_all = 25% Suppose this model output of [x1 x2 x3 x4] is [1,3,4,2],Groundtruth is [1,2,2,3]. So the prediction of x1 is accurate.But x2,x3,x4 is Mispredicted. accuracy of all eomtions: acc_all = correct_prediction_num / samples_num = 1/4 = 25%
I tried to understand the issue you mentioned, but I failed to understand it. It seems quite understandable if the output is [1,3,4,2] and the groundtruth is [1,2,2,3], the accuracy is only 25%.
yep, acc_all = 25%.I typed the wrong word :)
"Take the accuracy of emotion 1 as an example : if the element in output([1,3,4,2]) not equal to 1,they replace the element as 0.On the contrary,replace the element as 0. So the replaced output is [1,0,0,0]." -> what does it mean here?
I am sorry that I can't understand your words, can you re-phrase your question?
![Uploading image.png…]() in your code: for example: results =
[[[0.12,0.44] ,
[0.56,0.14] ,
[0.22,0.05] ,
[0.66,0.22]] ,
[[0.12,0.11] ,
[0.56,0.14] ,
[0.22,0.88] ,
[0.66,0.22]] ,
[[0.12,0.11] ,
[0.56,0.14] ,
[0.22,0.05] ,
[0.66,0.8]] ,
[[0.12,0.11] ,
[0.56,0.77] ,
[0.22,0.05] ,
[0.66,0.22]]]
test_truth =
[[1,0,0,0],
[0,1,0,0],
[0,1,0,0],
[0,0,1,0]]
test_preds = results.view(-1, 4, 2).cpu().detach().numpy() The shape of test_preds is [sample_num, emotion_num ,2].
test_truth = truths.view(-1, 4).cpu().detach().numpy(). The shape of test_truth is [sample_num, emotion_num].
test_preds_i = np.argmax(test_preds[:,emo_ind],axis=1) test_preds_i = [1,0,0,0](Same as replaced output = [1,0,0,0]), The shape of test_preds_i is [sample_num]. test_preds_i represents whether the sample is emotion_emo_ind. test_truth_i = test_truth[:,emo_ind] test_truth_i = [1,0,0,0](Same as replaced groundtruth = [1,0,0,0])
the accuracy of emotion_emo_ind(emotion1) = sklearn.metrics.accuracy_score(test_preds_i ,test_truth_i ) = sklearn.metrics.accuracy_score([1,0,0,0],[1,0,0,0] ) = 100%
Similarly, do the same for other emotions: emotion2 = sklearn.metrics.accuracy_score([0,0,0,1],[0,1,1,0] ) = 25% emotion3 = sklearn.metrics.accuracy_score([0,1,0,0],[0,0,0,1] ) = 50% emotion4 = sklearn.metrics.accuracy_score([0,0,1,0],[0,0,0,0] ) = 75%
those accuracy of emotions(calculate in your way) are mostly higher than acc_all(25%).
On the dataset which you provide,the accuracy of emotions(calculate in your way) are very pretty .Such high accuracy will make people feel that the classification ability of this model is very strong.But if we calculate accuracy by correct_prediction_num / samples_num ,we will find the classification ability of this model is not as good as we think.
`
def eval_iemocap(results, truths, single=-1):
emos = ["Neutral", "Happy", "Sad", "Angry"]
if single < 0:
test_preds = results.view(-1, 4, 2).cpu().detach().numpy()
test_truth = truths.view(-1, 4).cpu().detach().numpy()
for emo_ind in range(4):
print(f"{emos[emo_ind]}: ")
test_preds_i = np.argmax(test_preds[:,emo_ind],axis=1)
test_truth_i = test_truth[:,emo_ind]
f1 = f1_score(test_truth_i, test_preds_i, average='weighted')
acc = accuracy_score(test_truth_i, test_preds_i)
print(" - F1 Score: ", f1)
print(" - Accuracy: ", acc)
`
accuracy of emotion calculate in your way
Sorry, I forgot to say the dataset I used is iemocap which the code provide.
I have trained the model (by using the hyperparameter which you provide) and the result of acc,f1 of emotion I got are similar as yours.But the the classification ability of this model I trained is not well.
the way i calculate acc_all I calculate in this way because every samples have only one emotion.
Although accuracy of emotion are high,the acc_all is very low.
I changed output(results) shape to ( sample_num ,emotion_num)([N,4]) and calclulated acc_all,f1_all by:
I find the classification ability of this model is not well(acc_all,f1_all is very low) ,even though the acc,f1 of emotion(calculate in your way) is high.
We investigated this a bit more, and would like to thank you again for the close inspection of our code. On IEMOCAP, we essentially trained MulT as a binary classifier on each of the labels (i.e., a 2-way classification in a multitask fashion), because we regarded the task as a multi-label one (see paper for the dataset setting we consider). This was the setting used by similar prior works on this dataset, and we recycled their eval implementation and data processing pipeline. In a multilabel setting, the question we essentially asked the model to train on was a simple rule: "given a sample X and an emotion label E (out of multiple Es), did E occur in X or not?"--- which makes sense for multi-label predictions. (E.g., see https://github.com/Justin1904/Low-rank-Multimodal-Fusion/issues/11#issuecomment-650517936)
However, we acknowledge that we did not print & check every single sample in IEMOCAP. And we agree that, if it's indeed the case that all samples in the entire dataset have a unique label, then doing a 4-way classification make sense (like conventional image classification).
Nonetheless, the authors of https://github.com/A2Zadeh/CMU-MultimodalSDK mentioned IEMOCAP is indeed a multi-label dataset, yet the distributions of labels are skewed. In particular, most of the people in the video are identified as having the emotion "neutral".
This is the acc_all I calculate with your label settings. My code can be summed up as :
The acc_all (calculating in my way) should be high(if your model performs well),although you said “we essentially trained MulT as a binary classifier on each of the labels (i.e., a 2-way classification in a multitask fashion)”.But the acc_all is very low (<20%). I have tried a random classification and I still can get a high performance with your eval implementation.
Iemocap datasets you provide should be multi-labels,but they are unique_label in fact.So the differences between multi-label task and unique label task is small in your dataset. Your model should have a high performance in unique-label task,If your model can handle multi-labels task(harder than unique-label task) well .
I think the reason why most samples in iemocap are “neutral” can be explained as we need to make the dataset more consistent with the real distribution. In most cases, our emotions are neutral.
I have checked that every single samples in the dataset they provide have only one emotion condition.
example: x1 -> emotion1 x2 -> emotion2 none of the sample they provide like : x1-> emotion1,emotion2
so the ultimate goals of classify can be represented by x1 -> model ->one emotion
OK ,Let's move on to the topic: example: we have samples like [x1 x2 x3 x4] all emotions of samples belong to [1 ,2 ,3 ,4]
Suppose this model output of [x1 x2 x3 x4] is [1,3,4,2].Groundtruth is [1,2,2,3] So the normal accuracy of this model is 25%
But the way this code used to calculate acc : Take the accuracy of emotion 1 as an example : if the element in output([1,3,4,2]) not equal to 1,they replace the element as 0.On the contrary,replace the element as 0. So the replaced output is [1,0,0,0]. Do same thing in groundtruth([1,2,2,3]).The replaced groundtruth is [1,0,0,0]. the accuracy of eomotion 1 is 100% by calculate like this.(sklearn.metrics.accuracy_score([1,0,0,0] , [1,0,0,0] = 100%) By calculating in their way,the accuracy of specific emotion is much higher than normal accuracy.
I used a random classifier to predict the dataset and calculate accuracy in their way.The accuracy is also high. However, the high accuracy which calculate in their way does not mean that the classification performance of this model is very good!