Any explanation on feature window re-ordering?

xinjli / allosaurus

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages

GNU General Public License v3.0

571 stars 88 forks source link

Any explanation on feature window re-ordering? #49

Open Jackbennett opened 2 years ago

Jackbennett commented 2 years ago

Hi, I'm looking at shrinking the processing window down from the entire audio file at once.

Could you shed any light on this line? https://github.com/xinjli/allosaurus/blob/d9f1adaf47d3b3765b41f4177da62a051516d636/allosaurus/pm/utils.py#L19 Why does it use np.roll to move the frames to the front, and the end as well as joining all 3 together to widen the sample?

I'd spread out the one-liner as below to try and figure it out.

    rollup   = np.roll(feature, 1, axis=0)  # make last feature first
    rolldown = np.roll(feature, -1, axis=0) # make first feature last

    combined = np.concatenate((rollup, feature, rolldown), axis=1) # join all feature on second axis
    windowed = combined[::3, ] # removes features with overlapping samples

    return windowed

It seems to make all the overlapping features into 1 deeper sample and then drops all the overlaps by getting every 3rd item. But why the np.roll ?

xinjli commented 2 years ago

Hi, thanks for your question.

The intention here is to make each frame cover a longer audio span, roll is to enable you to cover neighbor features.

let's say you originally have features [1,2,3,4,5,6],

by roll up and rolldown you create two other features [2,3,4,5,6,1] and [6,1,2,3,4,5],
concatenating them give you [[6,1,2], [1,2,3], ..., [4,5,6], [5,6,1]]
then drop the overlapping ones you have [[6,1,2], [3,4,5]]

so now you have a smaller number of feature (6 -> 2), but each feature covers longer range (1 -> 3)

there are some mistakes at the beginning and at the ending because 6 should not before 1, but it is usually a small mistake and can be ignored.

willstott101 commented 2 years ago

We're experimenting with trying to create a live-streaming version of this project.

Would you accept a PR to change this logic to work better for live-streaming?

[1,2,3,4,5,6,7] -> [[1,2,3],[4,5,6],[7,7,7]] [1,2,3,4,5,6,7,8] -> [[1,2,3],[4,5,6],[7,8,8]]

Are there any ramifications to do with phoneme timings if we were to change this? If so, and if they're not easily resolvable perhaps this would work better:

[1,2,3,4,5,6,7] -> [[1,1,2],[3,4,5],[6,7,7]] [1,2,3,4,5,6,7,8,9] -> [[1,1,2],[3,4,5],[6,7,8],[9,9,9]]