Closed jkmacc-LANL closed 5 years ago
I'm actually seeing a flaw in my proposal already. It's easy to difference a flattened array without changing its shape by leaving the original first sample in place, and replacing the rest of the chunk with the differences. Diffing along a specific axis would require storing all the first samples in that axis, which would change the shape of the output chunk. Is keeping the same array/chunk shape before and after filtering critical?
Here's an illustration:
In [29]: ndays, nhours, nsamples, nchan = 2, 3, 5, 3
...: a = 1000 + np.arange(ndays*nhours*nsamples*nchan).reshape(nchan, -1).T
...: a = a.reshape((ndays, nhours, nsamples, nchan))
...: a
...:
Out[29]:
array([[[[1000, 1030, 1060],
[1001, 1031, 1061],
[1002, 1032, 1062],
[1003, 1033, 1063],
[1004, 1034, 1064]],
[[1005, 1035, 1065],
[1006, 1036, 1066],
[1007, 1037, 1067],
[1008, 1038, 1068],
[1009, 1039, 1069]],
[[1010, 1040, 1070],
[1011, 1041, 1071],
[1012, 1042, 1072],
[1013, 1043, 1073],
[1014, 1044, 1074]]],
[[[1015, 1045, 1075],
[1016, 1046, 1076],
[1017, 1047, 1077],
[1018, 1048, 1078],
[1019, 1049, 1079]],
[[1020, 1050, 1080],
[1021, 1051, 1081],
[1022, 1052, 1082],
[1023, 1053, 1083],
[1024, 1054, 1084]],
[[1025, 1055, 1085],
[1026, 1056, 1086],
[1027, 1057, 1087],
[1028, 1058, 1088],
[1029, 1059, 1089]]]])
In [30]: # this is how the Delta filter sees it
...: a.reshape(-1, order='A')
...:
Out[30]:
array([1000, 1030, 1060, 1001, 1031, 1061, 1002, 1032, 1062, 1003, 1033,
1063, 1004, 1034, 1064, 1005, 1035, 1065, 1006, 1036, 1066, 1007,
1037, 1067, 1008, 1038, 1068, 1009, 1039, 1069, 1010, 1040, 1070,
1011, 1041, 1071, 1012, 1042, 1072, 1013, 1043, 1073, 1014, 1044,
1074, 1015, 1045, 1075, 1016, 1046, 1076, 1017, 1047, 1077, 1018,
1048, 1078, 1019, 1049, 1079, 1020, 1050, 1080, 1021, 1051, 1081,
1022, 1052, 1082, 1023, 1053, 1083, 1024, 1054, 1084, 1025, 1055,
1085, 1026, 1056, 1086, 1027, 1057, 1087, 1028, 1058, 1088, 1029,
1059, 1089])
In [31]: # this is how it gets differenced
...: codec = Delta(dtype='i8', astype='i1')
...: codec.encode(a)
...:
Out[31]:
array([-24, 30, 30, -59, 30, 30, -59, 30, 30, -59, 30, 30, -59,
30, 30, -59, 30, 30, -59, 30, 30, -59, 30, 30, -59, 30,
30, -59, 30, 30, -59, 30, 30, -59, 30, 30, -59, 30, 30,
-59, 30, 30, -59, 30, 30, -59, 30, 30, -59, 30, 30, -59,
30, 30, -59, 30, 30, -59, 30, 30, -59, 30, 30, -59, 30,
30, -59, 30, 30, -59, 30, 30, -59, 30, 30, -59, 30, 30,
-59, 30, 30, -59, 30, 30, -59, 30, 30, -59, 30, 30],
dtype=int8)
The flattening destroys the correlations. Maybe the quickest solution is to re-arrange the array so that the dimension with correlations works with the flattening. I wonder if this will be a problem for large arrays that somehow get copied as a result of reshaping.
Would a different chunking achieve what you want?
Possibly. Does the Delta
filter get applied per chunk, such that the chunk_data.reshape(-1, 'A')
could align with the proper dimension of a well-posed chunk?
Convinced myself your suggestion will work for me. Thank you!
Glad to hear it. If you find you need this again, would be happy to discuss further. 🙂
I need the Delta filter over specific dimension/axis. I'm working with time-lapse 3D data, and I want to use delta encoding, which is widely used in 2D video formats. I made a SpatialDelta
filter and confirmed it works as expected. Here are some use cases:
import numpy as np
from numcodecs.spatial_delta import SpatialDelta
x = np.arange(27, dtype = np.uint16).reshape(3,3,3)
for axis in range(3):
codec = SpatialDelta(axis = 0, dtype = 'u2') # similar to Delta filter except setting axis
print(np.all(codec.decode(codec.encode(x)) == x)) # true
I want to add this feature in numcodecs
. @jakirkham , can I generate a pull request with this new filter?
Hello, and thank you for your work on this fantastic package. I'm constantly surprised by the thought that has gone into it.
I've got a 4D array in which I know that values in the 3rd dimension are highly correlated and would benefit from differencing. I've tested it by manually differencing before compression, and the benefit is 2-3x compared to using no
Delta
filter. When I use the filter instead of manually differencing, however, compression is slightly worse than not using it at all.I see in the source code for the
Delta
filter that data are flattened before differencing. I think this flattening step assumes that the input array represents data of the same "behavior" in all directions, which in some cases will not be true.Do you foresee any problems if I were to propose adding an
axis=None
keyword argument to theDelta
class and a small bit of logic toencode/diff
anddecode/cumsum
in this direction?Thanks for your attention.