xiph / LPCNet

Efficient neural speech synthesis
BSD 3-Clause "New" or "Revised" License
1.12k stars 295 forks source link

Inconsistency of recurrent matrix block sparse format between training (Keras) and runtime (C) #123

Closed patrickltobing closed 4 years ago

patrickltobing commented 4 years ago

I may have found a subtle, but probably very important, difference of sparse matrix implementation between training and runtime.

In the runtime code (C), the sparse matrix multiplication is done by the sparse_sgemv_accum16, where the multiplication and accumulation is done with a block of 16 rows from a total of 3x384 rows (for update, reset, new gates), i.e., in 72 iterations.

For each block of 16 rows, the non-sparse indices of the weight column are stored in the nnet_data.c, which contains at least 72 numbers telling the amount of non-sparse columns for the corresponding block of rows, and their indices [0,383], subsequently.

So, the matrix structure, when loaded from the nnet_data.c for runtime is assumed to be something like this:

    1st   2nd   .... 24th | 25th   ..... 48th | 49th   ..... 72nd | 
0   |==| |==|  ..... |==| | |==|   ..... |==| | |==|   ..... |==| |
1   |==| |==|  ..... |==| | |==|   ..... |==| | |==|   ..... |==| |
..      ....
383 |==| |==|  ..... |==| | |==|   ..... |==| | |==|   ..... |==| |

where |==| contains 16 values, and corresponds to the output part of the matrix multiplication, i.e., rows, and 0, 1, ..., 383 corresponds to the input part, i.e. columns.

The notations of rows and columns are a little bit confusing because the format of Keras GRU recurrent matrix is actually the transposed of the notation, as shown by the above sketch.

However, in training, I'm not sure why, the transpose (A = np.transpose(A, (1, 0)) and mask = np.transpose(mask, (1, 0))) is used within the Sparsify callback function, even though the structure of Keras recurrent matrix is actually already transposed.

So, in training, when doing the matrix sparsification, the block sparse format is actually become the transpose of that in the runtime (inconsistent), which is something like this:

        0    16   ... 368
1st    |==| |==| .... |==|
2nd    |==| |==| .... |==|
...
384th  |==| |==| .... |==|
385th  |==| |==| .... |==|
...
768th  |==| |==| .... |==|
769th  |==| |==| .... |==|
...
1152nd |==| |==| .... |==|

Note that the block of 16 values |==| is actually located for the input part of the matrix multiplication, instead of the output part. After sparsification, it is tranposed back, so, the assumed sparse block structure is the transposed of that of the first sketch.

Now, if the training and the runtime sparse matrix structures are consistent, the total entries in the sparse GRU recurrent weight indices in nnet_data.c should be roughly (0.1x384x384 // 16) + (384 // 16) = 993. The first term in LHS is due to the 0.1 sparse target of the recurrent weights, and the size of a block rows (16). The second term in LHS is to account for the total number of non-sparse columns in each block rows, i.e., 72 entries.

However, because it is inconsistent, where the saved sparse block structure from training is actually the transposed of the assumed structure in runtime, the number of entries of sparse GRU recurrent weight indices in nnet_data.c is 2837.

If the training is done with PyTorch, the transpose is needed during sparsification, assuming that the runtime will read the sparse block structure as the first sketch. Because the GRU recurrent matrix of PyTorch is the transpose of that of the Keras. But, if it is done with Keras, the transpose is not needed during the sparsification.

Probably it is a bit confusing, but if what I understood is correct, the computation time can be decreased because the iterations for columns is reduced by nearly a factor of 3, i.e., 993.

Please let me know if what I understood is incorrect, and thank you for this fundamental work on real-time/low-latency neural vocoder.

zhangjh915 commented 4 years ago

I am not sure if I understand your question correctly, but I once had a similar question when I worked on this code. Basically I found there is a transpose operation in Sparsify callback, while the transpose of A does not eliminate the above transpose, so that the GRU recurrent kernel is sparsified along columns. However in dump_lpcnet, it dumps the weights sparsified along rows. This confused me for some time and after experiments I noticed that when loading the weights the rows and columns are transposed due to use_gpu is false here so no CuDNNGRU but normal GRU layout is used. And Keras has its own transformation between different types of GRU. Check this out which helped me understand the differences. Hope this answer helps with your question.

patrickltobing commented 4 years ago

I noticed that when loading the weights the rows and columns are transposed due to use_gpu is false here so no CuDNNGRU but normal GRU layout is used

I'm not quite sure, but both CuDNNGRU and GRU for Keras seem to have the matrix format as [hidden_size,3*hidden_size], i.e., the transposed version, or maybe I'm wrong?

In any case, if the structure is consistent, the number of recurrent indices in nnet_data.c, printed out here, should be around 993, not 2837.

zhangjh915 commented 4 years ago

I noticed that when loading the weights the rows and columns are transposed due to use_gpu is false here so no CuDNNGRU but normal GRU layout is used

I'm not quite sure, but both CuDNNGRU and GRU for Keras seem to have the matrix format as [hidden_size,3*hidden_size], i.e., the transposed version, or maybe I'm wrong?

In any case, if the structure is consistent, the number of recurrent indices in nnet_data.c, printed out here, should be around 993, not 2837.

The dimension of the GRU matrix is [hidden_size,3*hidden_size], which means there are 3 gates and all gates are square matrices. The CuDNNGRU and GRU recurrent kernel weights in Keras have a transpose-like relation, which means each of the 3 gates have a transpose-like relation. The dimension of GRU weights is still [hidden_size,3*hidden_size].

patrickltobing commented 4 years ago

My computation is wrong, it should be 0.1x3x384x384 // 16 + 3x384 // 16 = 2837, which is already correct.