Error processing files with one annotation

vivinastase commented 2 years ago

When processing files with only one annotation, the system raises an error in module crowsetta/validation.py", line 65, in column_or_row_or_1d. This seems to happen because during loading the annotations file in function notmat2annot in module notmat.py, it uses evfuncs.load_notmat, which itself uses loadmat with option squeeze_me=True which makes a one element list into a scalar.

NickleDave commented 2 years ago

Hi @vivinastase sorry you're having this issue and thank you for letting us know.

Your description of the source of the error sounds right to me.

Could you please provide a little information just to help us squash the bug?

Could you cut and paste the whole traceback you're getting?
Is there a specific file that caused the error? E.g., did you find a specific file in this dataset with a single annotated segment? Or are you trying to use the same format with your own data?

vivinastase commented 2 years ago

Hi Dave

Here is the error traceback:

Traceback (most recent call last): File "/home/vivi/anaconda3/envs/tweetynet/bin/vak", line 8, in sys.exit(main()) File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/vak/main.py", line 45, in main cli.cli(command=args.command, config_file=args.configfile) File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/vak/cli/cli.py", line 30, in cli COMMAND_FUNCTION_MAPcommand File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/vak/cli/prep.py", line 132, in prep vak_df, csv_path = core.prep( File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/vak/core/prep.py", line 205, in prep vak_df = dataframe.from_files( File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/vak/io/dataframe.py", line 112, in from_files annot_list = scribe.from_file(annot_files) File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/crowsetta/notmat.py", line 86, in notmat2annot notmat_seq = Sequence.from_keyword(labels=np.asarray(list(notmat_dict['labels'])), File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/crowsetta/sequence.py", line 382, in from_keyword labels) = cls._validate_onsets_offsets_labels(onsets_s, File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/crowsetta/sequence.py", line 267, in _validate_onsets_offsets_labels onsets_s = column_or_row_or_1d(onsets_s) File "/home/vivi/anaconda3/envs/tweetynet/lib/python3.8/site-packages/crowsetta/validation.py", line 64, in column_or_row_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape ()

With regards to the dataset, I was using my own. Here is how the problem-causing annotation file looked like when loaded with loadmat:

{'header': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Thu Mar 10 10:24:41 2022', 'version': '1.0', 'globals': [], 'Fs': array([[44100]]), 'fname': array(['R3406_40911.56229478_1_3_15_37_9.wav'], dtype='<U36'), 'onsets': array([[2336.50793651]]), 'offsets': array([[2449.70521542]]), 'num_sylls': array([[1]]), 'labels': array(['a'], dtype='<U1')}

and here is the same file loaded with evfuncs.load_notmat

{'header': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Thu Mar 10 10:24:41 2022', 'version': '1.0', 'globals': [], 'Fs': 44100, 'fname': 'R3406_40911.56229478_1_3_15_37_9.wav', 'onsets': 2336.5079365079364, 'offsets': 2449.705215419501, 'num_sylls': 1, 'labels': 'a'}

After a patch that makes an array from the scalar, this part of the processing went well.

NickleDave commented 2 years ago

I see, thank you @vivinastase that helps track down the source

I think the right fix might be to add a final if that catches cases where there's only one annotated segment. I need to double-check but I'm pretty sure there's other code that expects to get 1-d arrays when loading the .not.mat format, so removing squeeze_me=True might cause other issues.

Would it be easier for you to work with another format? You can use a simple .csv file of annotations as described here: https://vak.readthedocs.io/en/latest/howto/howto_user_annot.html Please let us know if that how-to guide is not clear, we could revise it. We could also point to it somewhere else in the vak docs--maybe the tutorial gave you the impression you needed to use the .not.mat format?

You can also use other formats that crowsetta can parse, like Praat TextGrid files (not sure how you're annotating your data).

vivinastase commented 2 years ago

No, removing squeeze_me would cause other problems because of the data shape expected by the column_or_row_or_1d function. I added an if-based patch as you also suggested.

In response to your question about data formatting -- no, I didn't have the impression that I had to use this format. It was just easier because the example data used this format, and it is not a problem to reproduce it. I had tried to use the csv format, but got some errors that were harder to track down (sorry, I removed that version of the dataset, so cannot post a traceback). Maybe having some example data with csv annotations would make it easier for people who want to use this format? It's easier if you just see the file, rather than read explanations, and possibly misinterpret them :)

NickleDave commented 2 years ago

Maybe having some example data with csv annotations would make it easier for people who want to use this format? It's easier if you just see the file, rather than read explanations, and possibly misinterpret them :)

Yes, agreed, thank you

NickleDave commented 2 years ago

@vivinastase I am going to close this -- I just opened an issue on the vak repo so I can track and fix it there.

I included a link to the discussion here. Thank you for your valuable feedback. I don't mean to give you the impression we will not fix this, I just want to make sure I handle it as an issue with vak and crowsetta, not tweetynet itself

yardencsGitHub / tweetynet

Error processing files with one annotation #204