zcxu-eric / Ego4d_TalkNet_ASD

15 stars 9 forks source link

Misalignment of audio-visual frames and labels #2

Open SAGNIKMJR opened 1 year ago

SAGNIKMJR commented 1 year ago

Hi,

There is a chance of misalignment between the AV frames and the labels in dataloader.py due to the interpolation in https://github.com/zcxu-eric/Ego4d_TalkNet_ASD/blob/ab9f345efc49fd70ed163c6cca674c3aff88e2b6/dataLoader.py#L158. I think that there could be 2 ways to handle this: 1) have contiguous AV frames and do label interpolation, or 2) have discontiguous AV frames and not do label interpolation. Do you expect either of these options to work better?

P.S. This is similar to https://github.com/zcxu-eric/Ego4d_TalkNet_ASD/issues/1#issue-1306619210.

Thanks, Sagnik

zcxu-eric commented 1 year ago

Hi, I think both are good enough if there is not any difference in performance, but it is hard to evaluate because the labels are not so accurate. Feel free to raise a pull request if you would like to help fix it. Thanks a lot.

SAGNIKMJR commented 1 year ago

Also, some clip frame numbers that are there in the bbox annotations (stored in data/ego4d/bbox) are missing in data/video_imgs. For e.g., there is frame number 8991 at index 291 of the list stored in data/ego4d/bbox/34bc808a-4b98-42c5-a1ea-6f565dd4aa20:track_0:29.json but there is no corresponding image frame for it, i.e., data/video_imgs/34bc808a-4b98-42c5-a1ea-6f565dd4aa20/img_08991.jpg doesn't exist. I looked at https://github.com/zcxu-eric/Ego4d_TalkNet_ASD/blob/main/scripts/extract_frame.sh and it looks like it dumps all clip frames in data/video_imgs. Any idea why this is the case? Thanks.

zcxu-eric commented 1 year ago

Hi, we didn't encounter this bug on our side. One quick solution is to check the data sanity before training and discard these labels if necessary.