Implications of Z-Normalization in the Matrix Profile

hexagonhill commented 7 months ago

Hello,

I am struggling to apply the suggestions from the paper(Implications of Z-Normalization in the Matrix Profile) to SCAMP - CPU only at the moment- as while the formula is not too difficult, it's computaion is not local in code and I suck at statistics.

I was hoping for some guidence on how to implement it in SCAMP. My current thoughts are it needs to be calculated in the handle_row functions(where I think the correlation is being calculated). The standard deviation of the noise via SCAMPKernelInputArgs, requiring some extra members. The standard deviations of the subsquences are the variables normsa and normsb in handle_rows.

Would this be a suitable approach?

Thanks

zpzim commented 7 months ago

Hi, I have not read this paper thoroughly, but I assume you are talking about algorithm 1 mentioned in the paper?

Internally, SCAMP computes the correlation between every pair of subsequences using the formula outlined in the SCAMP paper. you can see discussion here in issue #39 for some detailed discussion about the derivation, which may or may not be helpful.

SCAMP converts that correlation into distance as a last step post-process using the formula mentioned in the SCAMP paper.

I think in order to get things to work in SCAMP, you would need to rewrite Algorithm 1 in that paper in terms of the formulas used in SCAMP. Since this formula applies a correction to the distance, you will need to represent that as a correction to the correlation SCAMP is computing. I think this might be possible, but we would need to derive the correction formula.

@kavj was the one who originally derived the factorization of the correlation used in the SCAMP paper, he might be able to help if he's around.

hexagonhill commented 7 months ago

Thanks. Sorry, made assumption it was a small world in regards to matrix profile research. I am not a researcher, I just use your code on a Jetson nano with some minor changes. i.e. Write to std::vector instead of file.

Yes that is the formula and I am aware you convert the correation to distance after all the heavy compuation is done. I read some of your paper(Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise Comparisons a Day and Beyond) and had the impression Section 3.3, equation 6 corresponds to src/cpu_kernels.cpp - handle_row function and all that had to be done was something like(pseudo C++, ignore eigien vectorisation)

corr = (cov * normsa * args.normsb[info.row] - k*( args.noise_std_devation / std::max(normsa,args.normsb[info.row])).

I guess I am just asking based on your experience if I have the right interpretation of the corrDist and your equation.

zpzim commented 7 months ago

Okay, so I took a look at the formulas again:

normsa and normsb in the SCAMP kernel should correspond to 1/(std_a^2) and 1/(std_b^2) respectively.

Given that we are computing correlation in the kernel and not distance, in Algorithm 1 above d^2 corresponds to 2m(1-corr) think the correct formula for the correction factor according to the variables we have available in the kernel is:

(nose_std_dev)^2 / max(1/normsa, 1/normsb)

However, I have not fully derived if this correction factor can be used as is, or needs to be modified because we are dealing with correlation here.

If you want to compute the distance directly in the kernel, it is going to look like this:

corr = cov * normsa * normsb / eq 6 in the SCAMP paper
dist_squared = 2*m*(1-corr) // uses eq 7 in the SCAMP paper
correction = (nose_std_dev)^2 / max(1/normsa, 1/normsb) // Derived above
distCorr = sqrt(dist_squared - (2+m)*correction) // via Algorithm 1 from your paper.

However, we need to make sure we keep everything as a corrected correlation inside of the kernel if possible, else we need to change how the reduction happens. We take the max of all compute correlations, whereas if we compute distance it needs to be the min. Can you show how you derived the equation in your comment? Specifically where the k factor comes from? Assuming this is the sublen right? I think it might need to be something like 1/k with some added terms, but not sure.

Remember, the kernel is computing max(corr_i) for all subsequences i, then we convert it to distance at the end using equation 7. We can try to work backwards from there to derive the formula we need for the corrected correlation.

hexagonhill commented 7 months ago

Thanks for your help, it has given me more confidence and good catch about the use of variance in the correction and my desciption of it being standard deviation. If the changes suggested by the paper I referenced in issue title work for the data I have I will try and code it in a way that can be chosen via arguments and send you a pull request; For CPU and GPU. If you are interested.

hexagonhill commented 7 months ago

I have it working for CPU at the moment. Are you interested in a initial Pull Request, with the GPU implementation to come later?

zpzim commented 7 months ago

Awesome! Feel free to send me a PR for the CPU code. I actually have been working for a while on cleaning up the GPU kernels, so it would be good to take a look at the GPU implementation after I am able to merge my existing changes.

hexagonhill commented 6 months ago

Done. Be mindful the PR is a little incomplete. The changes can cause NaN, though SCAMP converts them to zeros. The correction to the pearson correlation causes the value to go greater than one. So when converted to distance there are sqrt of negative numbers. Given the nature of the data - mostly flat and a bit noisy - the minimums are proabably of little interest.

Fairly confident the implemation is correct because the ouput matches the results of a naive version of the matrix profile I wrote.

zpzim / SCAMP

Implications of Z-Normalization in the Matrix Profile #127