Delta filter over specific dimension/axis

jkmacc-LANL commented 5 years ago

Hello, and thank you for your work on this fantastic package. I'm constantly surprised by the thought that has gone into it.

I've got a 4D array in which I know that values in the 3rd dimension are highly correlated and would benefit from differencing. I've tested it by manually differencing before compression, and the benefit is 2-3x compared to using no Delta filter. When I use the filter instead of manually differencing, however, compression is slightly worse than not using it at all.

I see in the source code for the Delta filter that data are flattened before differencing. I think this flattening step assumes that the input array represents data of the same "behavior" in all directions, which in some cases will not be true.

class Delta(Codec):
    ...
    def encode(self, buf):
        ...
        # flatten to simplify implementation
        arr = arr.reshape(-1, order='A')
        ...
        # compute differences
        enc[1:] = np.diff(arr)

        return enc

    def decode(self, buf, out=None):
        ...
        # flatten to simplify implementation
        enc = enc.reshape(-1, order='A')
        ...
        # decode differences
        np.cumsum(enc, out=dec)
        ...

        return out

Do you foresee any problems if I were to propose adding an axis=None keyword argument to the Delta class and a small bit of logic to encode/diff and decode/cumsum in this direction?

Thanks for your attention.

jkmacc-LANL commented 5 years ago

I'm actually seeing a flaw in my proposal already. It's easy to difference a flattened array without changing its shape by leaving the original first sample in place, and replacing the rest of the chunk with the differences. Diffing along a specific axis would require storing all the first samples in that axis, which would change the shape of the output chunk. Is keeping the same array/chunk shape before and after filtering critical?

jkmacc-LANL commented 5 years ago

Here's an illustration:

In [29]: ndays, nhours, nsamples, nchan = 2, 3, 5, 3
    ...: a = 1000 + np.arange(ndays*nhours*nsamples*nchan).reshape(nchan, -1).T
    ...: a = a.reshape((ndays, nhours, nsamples, nchan))
    ...: a
    ...:
Out[29]:
array([[[[1000, 1030, 1060],
         [1001, 1031, 1061],
         [1002, 1032, 1062],
         [1003, 1033, 1063],
         [1004, 1034, 1064]],

        [[1005, 1035, 1065],
         [1006, 1036, 1066],
         [1007, 1037, 1067],
         [1008, 1038, 1068],
         [1009, 1039, 1069]],

        [[1010, 1040, 1070],
         [1011, 1041, 1071],
         [1012, 1042, 1072],
         [1013, 1043, 1073],
         [1014, 1044, 1074]]],

       [[[1015, 1045, 1075],
         [1016, 1046, 1076],
         [1017, 1047, 1077],
         [1018, 1048, 1078],
         [1019, 1049, 1079]],

        [[1020, 1050, 1080],
         [1021, 1051, 1081],
         [1022, 1052, 1082],
         [1023, 1053, 1083],
         [1024, 1054, 1084]],

        [[1025, 1055, 1085],
         [1026, 1056, 1086],
         [1027, 1057, 1087],
         [1028, 1058, 1088],
         [1029, 1059, 1089]]]])

In [30]: # this is how the Delta filter sees it
    ...: a.reshape(-1, order='A')
    ...:
Out[30]:
array([1000, 1030, 1060, 1001, 1031, 1061, 1002, 1032, 1062, 1003, 1033,
       1063, 1004, 1034, 1064, 1005, 1035, 1065, 1006, 1036, 1066, 1007,
       1037, 1067, 1008, 1038, 1068, 1009, 1039, 1069, 1010, 1040, 1070,
       1011, 1041, 1071, 1012, 1042, 1072, 1013, 1043, 1073, 1014, 1044,
       1074, 1015, 1045, 1075, 1016, 1046, 1076, 1017, 1047, 1077, 1018,
       1048, 1078, 1019, 1049, 1079, 1020, 1050, 1080, 1021, 1051, 1081,
       1022, 1052, 1082, 1023, 1053, 1083, 1024, 1054, 1084, 1025, 1055,
       1085, 1026, 1056, 1086, 1027, 1057, 1087, 1028, 1058, 1088, 1029,
       1059, 1089])

In [31]: # this is how it gets differenced
    ...: codec = Delta(dtype='i8', astype='i1')
    ...: codec.encode(a)
    ...:
Out[31]:
array([-24,  30,  30, -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,
        30,  30, -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,  30,
        30, -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,  30,  30,
       -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,
        30,  30, -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,  30,
        30, -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,  30,  30,
       -59,  30,  30, -59,  30,  30, -59,  30,  30, -59,  30,  30],
      dtype=int8)

The flattening destroys the correlations. Maybe the quickest solution is to re-arrange the array so that the dimension with correlations works with the flattening. I wonder if this will be a problem for large arrays that somehow get copied as a result of reshaping.

jakirkham commented 5 years ago

Would a different chunking achieve what you want?

jkmacc-LANL commented 5 years ago

Possibly. Does the Delta filter get applied per chunk, such that the chunk_data.reshape(-1, 'A') could align with the proper dimension of a well-posed chunk?

jkmacc-LANL commented 5 years ago

Convinced myself your suggestion will work for me. Thank you!

jakirkham commented 5 years ago

Glad to hear it. If you find you need this again, would be happy to discuss further. 🙂

ehgus commented 1 month ago

I need the Delta filter over specific dimension/axis. I'm working with time-lapse 3D data, and I want to use delta encoding, which is widely used in 2D video formats. I made a SpatialDelta filter and confirmed it works as expected. Here are some use cases:

import numpy as np
from numcodecs.spatial_delta import SpatialDelta

x = np.arange(27, dtype = np.uint16).reshape(3,3,3)
for axis in range(3):
  codec = SpatialDelta(axis = 0, dtype = 'u2') # similar to Delta filter except setting axis
  print(np.all(codec.decode(codec.encode(x)) == x)) # true

I want to add this feature in numcodecs. @jakirkham , can I generate a pull request with this new filter?

zarr-developers / numcodecs

Delta filter over specific dimension/axis #198