yeesian / ArchGDAL.jl

A high level API for GDAL - Geospatial Data Abstraction Library
https://yeesian.github.io/ArchGDAL.jl/stable/
Other
142 stars 26 forks source link

Support PermutedDimsArray for interactive datasets #225

Open maxfreu opened 3 years ago

maxfreu commented 3 years ago

Hi, I noticed the following:

img = AG.readraster("somefile");
typeof(img)
# ArchGDAL.RasterDataset{Int16, ArchGDAL.IDataset}
@time collect(img)  # fast, 0.6s

imgp = permutedims(img, (3,1,2));
typeof(imgp)
# DiskArrays.PermutedDiskArray{Int16, 3, PermutedDimsArray{Int16, 3, (3, 1, 2), (2, 3, 1), ArchGDAL.RasterDataset{Int16, ArchGDAL.IDataset}}}
# So underneath it's already a PermudedDimsArray somehow
@time collect(imgp) # fast 0.7s

imgpda = PermutedDimsArray(img, (3,1,2));
typeof(imgpda)
# PermutedDimsArray{Int16, 3, (3, 1, 2), (2, 3, 1), ArchGDAL.RasterDataset{Int16, ArchGDAL.IDataset}}
@time collect(imgpda) # super slow, 43s

So I assume the PermutedDimsArray translates to super inefficient disk access. As ArchGDAL already handles the interactive dataset lazily with permutedims I propose the super simple fix:

Base.PermutedDimsArray(A::RasterDataset{Any, IDataset}, perm) = permutedims(A, perm)
visr commented 3 years ago

Hmm, interesting.

Perhaps it's best to raise this issue with DiskArrays? cc @meggart

Over there permutedims is forwarded to PermutedDiskArray: https://github.com/meggart/DiskArrays.jl/blob/7fa75d4a0eb216d7e25a6718b223b8533d0b92c4/src/permute_reshape.jl#L64-L67

Probably Base.PermutedDimsArray should be forwarded to the same (like your proposed fix, but in DiskArrays).

meggart commented 3 years ago

Probably Base.PermutedDimsArray should be forwarded to the same (like your proposed fix, but in DiskArrays).

This proposed fix does not work because the PermutedDiskArray already wraps a PermuteDimsArray, and just gives it disk-efficient access patterns, so there is something cyclic when you try to construct such a thing. One could solve this by adding some invoke calls, but in general I think it would also be strange if a constructor of PermutedDimsArray would return an object of a PermutedDiskArray.

What would speak against directly calling permutedims, which gives you the desired performance?

I really don't see another solution to this problem. If you wrap your array into a PermutedDiskArray, which is defined in Base, this is what you get and if this type assumes efficient random access to the underlying data it will be slow.

maxfreu commented 3 years ago

What would speak against directly calling permutedims, which gives you the desired performance?

Not much. I just would have expected that it's equally fast. This finding resulted from trying to avoid eager loading from disk. Constructing a PermutedDimsArray is the canonical way to do things lazily (just like view for getindex), so I tried that first and failed. I thought maybe others might fall into that trap too, if they write "lazy" code. I leave it up to you to close this (and the other issue), because I have no further ideas how to handle this; returning a PermutedDiskArray sounds strange indeed.