I have added two arguments in /allosaurus/allosaurus/am/allosaurus_torch.py forward() function.
:return_lstm: [list containing the output_embeddings and their respective lengths]
:return_both: tuple containing (a list containing the output_embeddings and their respective lengths and the ouptut of phone layer)
This would facilitate a user to extract the embeddings for any downstream task. For example intent classification.
How to use it given a list of paths to a wav_file?
import torch
import numpy as np
from torch.nn.utils.rnn import pad_sequence
from allosaurus.audio import read_audio
from allosaurus.app import read_recognizer
from allosaurus.am.utils import *
recognizer = read_recognizer()
wav_paths = ['/home/hemant/cmu/fluent_speech_commands_dataset/wavs/speakers/k5bqyxx2lzIbrlg9/16f1a930-452a-11e9-a843-8db76f4b5e29.wav',
'/home/hemant/cmu/fluent_speech_commands_dataset/wavs/speakers/NgQEvO2x7Vh3xy2xz/5a9e2580-45bd-11e9-8ec0-7bf21d1cfe30.wav']
feats, feat_lens = [], []
for wav_path in wav_paths:
feat = torch.tensor(recognizer.pm.compute(read_audio(wav_path))) # batch, len, features
feat_len = torch.tensor(np.array([feat.shape[0]], dtype=np.int32)) # 1D array
feats.append(feat)
feat_lens.append(feat_len)
feats = pad_sequence(feats,batch_first=True,padding_value=0) # batch,features,len
feat_lens = pad_sequence(feat_lens,batch_first=True,padding_value=0).squeeze()
idx = torch.argsort(feat_lens,descending=True) # sorting the input in descending order as required by the lstms in AM.
tensor_batch_feat, tensor_batch_feat_len = move_to_tensor([feats[idx], feat_lens[idx]], recognizer.config.device_id) # converting to the required tensors
# Features
output_tensor, input_lengths = recognizer.am(tensor_batch_feat, tensor_batch_feat_len, return_lstm=True) # output_shape: [len,batch,features]
LMK your thoughts on this.
Thanks for open-sourcing the work. Really appreciate it.
Hi @xinjli ,
I have added two arguments in
/allosaurus/allosaurus/am/allosaurus_torch.py
forward() function.This would facilitate a user to extract the embeddings for any downstream task. For example intent classification.
How to use it given a list of paths to a wav_file?
LMK your thoughts on this. Thanks for open-sourcing the work. Really appreciate it.