Open Jackbennett opened 2 years ago
Hi, thanks for your question.
The intention here is to make each frame cover a longer audio span, roll is to enable you to cover neighbor features.
let's say you originally have features [1,2,3,4,5,6],
so now you have a smaller number of feature (6 -> 2), but each feature covers longer range (1 -> 3)
there are some mistakes at the beginning and at the ending because 6 should not before 1, but it is usually a small mistake and can be ignored.
We're experimenting with trying to create a live-streaming version of this project.
Would you accept a PR to change this logic to work better for live-streaming?
[1,2,3,4,5,6,7]
-> [[1,2,3],[4,5,6],[7,7,7]]
[1,2,3,4,5,6,7,8]
-> [[1,2,3],[4,5,6],[7,8,8]]
Are there any ramifications to do with phoneme timings if we were to change this? If so, and if they're not easily resolvable perhaps this would work better:
[1,2,3,4,5,6,7]
-> [[1,1,2],[3,4,5],[6,7,7]]
[1,2,3,4,5,6,7,8,9]
-> [[1,1,2],[3,4,5],[6,7,8],[9,9,9]]
Hi, I'm looking at shrinking the processing window down from the entire audio file at once.
Could you shed any light on this line? https://github.com/xinjli/allosaurus/blob/d9f1adaf47d3b3765b41f4177da62a051516d636/allosaurus/pm/utils.py#L19 Why does it use np.roll to move the frames to the front, and the end as well as joining all 3 together to widen the sample?
I'd spread out the one-liner as below to try and figure it out.
It seems to make all the overlapping features into 1 deeper sample and then drops all the overlaps by getting every 3rd item. But why the
np.roll
?