zihaohe123 / speak-turn-emb-dialog-act-clf

23 stars 8 forks source link

Length of predictions does not match input #5

Open angoodkind opened 2 years ago

angoodkind commented 2 years ago

I am using inference mode. My test.csv file has 4.480 utterances. The output of preds_on_new.pkl is a list that is 12,342 items long. Am I doing something wrong?

zihaohe123 commented 2 years ago

Hi Adam, sorry for that. There was a bug in my code. The save labels included those padding utterances. Now it should work.

angoodkind commented 2 years ago

Unfortunately it's still off. My test.csv has 4879 utterances. The list of predictions is 4895 items. It's much closer, but there's still a misalignment.

zihaohe123 commented 2 years ago

could you provide with me your test.csv?

angoodkind commented 2 years ago

Let me know if this works. Thanks!

https://drive.google.com/file/d/1zSnPC5pkYZXN0tCex8DLG-M9wZVSy99D/view?usp=sharing

zihaohe123 commented 2 years ago

From my side the output length is 4879. I just used the model I shared with you to do the inference.

At column "speaker", the values should be 0s and 1s indicating the speaker turns. So what I did was mapping subject1 and subject2 to 0 and 1. Without doing this the code couldn't run. I am not sure how it got run in your side?

Regarding the utterance ID. The output should be the same length as the input csv file. So I don't see a point of adding that. If there's a mismatch of the length, there must be something wrong. In that case I won't trust the output even it has utterance ids.

angoodkind commented 2 years ago

Interesting. I saw in some of the changes i the last commit that you changed how the model.pt is named. Did you rename it for your run, as well?

zihaohe123 commented 2 years ago

The model I shared with you was named "model_swda.pt". I renamed it to "model.pt" so it can be loaded in the inference() function.

angoodkind commented 2 years ago

Ok, right, that's what I'd been doing. I'm still getting a result that's the wrong length. Would you mind sending me the correct length list, and I'll try to do more debugging later?

zihaohe123 commented 2 years ago

https://drive.google.com/file/d/1GupSeKyoxgNu4-XBYeLEkTysVfXXpxpL/view?usp=sharing

zihaohe123 commented 2 years ago

The results don't look good to me. I guess that's because your data is very clean which is very different from the switchboard corpus that has a lot of punctuations and symbols. You can check a few utterances in switchboard. Another researcher who used this repo once faced the same issue. He solved it by training a new model based on the cleaned swithchboard corpus and then using the model to infer on his own dataset.

angoodkind commented 2 years ago

1) Thanks for the pickled file. It does work.

2) I agree about the results. I don't suppose you have a link to the cleaned corpus, do you? If not, it seems pretty trivial to do on my own.

zihaohe123 commented 2 years ago

That's right I don't have a cleaned corpus.

If doing it on your own, 1) please pay attention to the contents in square brackets. 2) In switchboard there are a lot of utterances labeled as interrupted and continued later ('+' tags ,42 in our case), which means the utterance is not a full utterance. It's better to concatenate those since in your dataset every utterance is a full utterance. 3) Utterances labeled as Non-verbal ('x' tags, 26 in our case) should be removed since they're non-existent in your dataset either.

Seem this corpus has already done 2) and 3).

angoodkind commented 2 years ago

Thanks!

SebastianSpeer commented 1 year ago

Hi,

I'm running into a similar issue: the dataset I'm running inference on is much cleaner than the swda dataset. As a result a lot of the utterances are coded as 26: nonverbal or 42: other. @angoodkind & @zihaohe123 has it helped to work with a cleaner version of swda? Does any of you have a cleaned up version available? I would very much appreciate the help. Thanks in advance!

SebastianSpeer commented 1 year ago

@zihaohe123 in the cleaned up version you're referring to in your previous post the conv_id is missing. Is that an issue?

zihaohe123 commented 1 year ago

@SebastianSpeer Hi Sebastian,

Thank you for your interest! I'm sorry that the results do not look good on your own dataset. I'm not very sure what the problems might be. But it shouldn't be because of missing conv_id. Please take a closer look at the data in swda and your own data and see if you can find how they are different.

SebastianSpeer commented 1 year ago

so the conv_id has no influence on the classification accurracy?

zihaohe123 commented 1 year ago

@SebastianSpeer Sorry my bad! conv_id is actually very important as it indicates whether utterances are of one conversation. You should add it to your data.

SebastianSpeer commented 1 year ago

@zihaohe123, thanks for your reply. I have it in my data, but the clened version of the switchboard data you were referring to (https://github.com/NathanDuran/Switchboard-Corpus) does not have the conv id. Do you know of other cleaned versions of the switchboard corpus?

zihaohe123 commented 1 year ago

@SebastianSpeer The NathanDuran's corpus does have the conv_id. Go to soda_data/train. Each .txt file is a conversation. The file name is the conv_id.