Closed NickleDave closed 3 years ago
Results basically show that you could achieve comparable segment error rate with a classifier that uses features extracted from segments, if you had perfect cleaned-up segmentation. But of course you don't have that IRL, so the resulting false positives inflate the error rate.
Final figure would have labels for the mean lines (TweetyNet v. SVM)
I've convinced myself that the higher error I originally saw was due to an off-by-one error in the previous implementation of the predict function.
That version looked like this.
import vak # to avoid circular imports
predict_dst = Path(predict_dst).expanduser().resolve()
if not predict_dst.exists() or not predict_dst.is_dir():
raise NotADirectoryError(
f'predict_dst not found, or not recognized as a directory:\n{predict_dst}'
)
extract_df = pd.read_csv(extract_csv_path)
extract_df = extract_df[extract_df.split == split]
clf = joblib.load(clf_path)
labelset = vak.converters.labelset_to_set(labelset)
labelmap = vak.labels.to_map(labelset, map_unlabeled=False)
inverse_labelmap = {v: k for k, v in labelmap.items()}
ftr_paths = extract_df.features_path.values.tolist()
ftr_dfs = []
for row_num, ftr_path in enumerate(tqdm(ftr_paths)):
ftr_df = pd.read_csv(ftr_path)
# "foreign key" maps back to row of resegment_df
# so we can figure out which predictions are for which row
ftr_df['foreign_key'] = row_num
ftr_dfs.append(ftr_df)
ftr_df = pd.concat(ftr_dfs)
x_pred = ftr_df.drop(labels=['labels', 'foreign_key'], axis="columns").values
y_pred = clf.predict(x_pred)
split_inds = np.nonzero(np.diff(ftr_df.foreign_key.values))[0]
y_pred_list = np.split(y_pred, split_inds)
y_pred_list = [
''.join([inverse_labelmap[el] for el in y_pred]) + "\n"
for y_pred in y_pred_list
]
pred_path = predict_dst / (extract_csv_path.stem + f'.pred.txt')
with pred_path.open('w') as fp:
fp.writelines(y_pred_list)
return pred_path
The problem here is that if you put in a breakpoint()
and inspect split_inds
you'll find that they are off by one, because np.diff
by definition returns out[i] = a[i+1] - a[i]
.
This means that if the column with monotonically increasing integers in the ftrs_df
that represents which annotation/audio each row is from, 'foreign_key'
, changes from 0
to 1
at row 26
(Pdb) ftr_df.foreign_key.values[:27]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1])
then the first value of split_inds
will be 25
(Pdb) np.diff(ftr_df.foreign_key.values)[:26]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1])
(because split_inds = np.nonzero(np.diff(ftr_df.foreign_key.values))[0]
.)
This off-by-one error might at first look like it only affects two sequences, the first and last, if you were to only inspect the lengths of the ground truth and predicted sequences
(Pdb) annot_lens
[26, 110, 72, 113, 220, 135, 116, 130, 216, 38, 194, 134, 173, 213, 89, 297, 31, 18, 131, 119, 174]
(Pdb) [len(y_pred) for y_pred in y_pred_list]
[25, 110, 72, 113, 220, 135, 116, 130, 216, 38, 194, 134, 173, 213, 89, 297, 31, 18, 131, 119, 175]
but what this misses is that each of them will be off by one! This obviously increases the syllable error rate
Fixed other errors as detailed in commits, re-ran all results.
Will do further figure tweaking but merging in now.
@yardencsGitHub I will squash these commits before merging but wanted to start the PR to draw your attention to results
I think we should include this as a main figure, especially given the rewrite of the intro to focus on limitations of alternate approaches. These results provide real evidence of those limitations, so we can say "as we show in results"