yeesian / ArchGDAL.jl

A high level API for GDAL - Geospatial Data Abstraction Library
https://yeesian.github.io/ArchGDAL.jl/stable/
Other
141 stars 27 forks source link

performance and type stability fixes for geometries #316

Closed rafaqz closed 1 year ago

rafaqz commented 2 years ago

Improves type stability where possible by keeping enums as types in Val{wkbLineString} and by forcing child object types from getgeom.

Also adds a low allocation getpoint method.

Edit: review when I get tests passing, can't always test on a local machine at present

rafaqz commented 2 years ago

The failures are an unrelated issue with GDAL_jll on mac. Otherwise this is good to go.

rafaqz commented 2 years ago

@evetion do you have time to review this?

evetion commented 2 years ago

@evetion do you have time to review this?

Just got back from holiday, will take a look soon

rafaqz commented 2 years ago

@evetion getpoint works great, it's just a bit unrelated and less finished, so I will PR and test separately. Either we can use GDAL to get all the points eagerly (allocating a vector) or get them lazily one at a time only allocating one point all up. I need to test which is best.

As for performance improvements this is just from getting rid of all the red and yellow bits in ProfileView.jl rasterising GADM.jl output with Rasters.jl.

The before and after included all the changes to Rasters.jl and DimensionalData.jl as well as I was optimising the whole stack... so I'm not totally clear on total times attributable to ArchGDAL. But the fraction in ArchGDAL.jl shrunk a lot from this PR combined with low-allocating getpoint.

Mostly I think the improvement for my use cases from this PR is type stability in getgeom on polygons, getting a known IGeometry{wkbLineString} (or at least a small union with wkbUnkown). It lets any functions using getgeom compile clean native code instead of dealing with boxed results.

rafaqz commented 2 years ago

I put together some benchmarks of these changes. There is still another 2x improvment (or more) in fixing getgeom, but just type stability gives order of magnitude performance gains for rasterization:

Setup code (using GADM for an easy example, which is using ArchGDAL for loading geometries):

using Rasters, GADM, BenchmarkTools
using Rasters.LookupArrays
kw = (; order=ForwardOrdered(), span=Regular(0.1), sampling=Intervals(Start()), crs=EPSG(4326))
ds = Y(Projected(15.0:0.1:55.0; kw...)), X(Projected(70.0:0.1:140; kw...))
x = GADM.get("CHN")

This branch:

julia> @btime rasterize(x.geom; to=ds, missingval=0, fill=1, boundary=:touches);
  512.476 ms (5890904 allocations: 116.34 MiB)

julia> @benchmark rasterize(x.geom; to=ds, missingval=0, fill=1, boundary=:touches)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  508.071 ms … 681.143 ms  ┊ GC (min … max): 0.00% … 9.94%
 Time  (median):     675.751 ms               ┊ GC (median):    9.88%
 Time  (mean ± σ):   653.399 ms ±  59.138 ms  ┊ GC (mean ± σ):  8.62% ± 3.44%

  ▁                                                   ▁    ▁▁██
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁████ ▁
  508 ms           Histogram: frequency by time          681 ms <

 Memory estimate: 116.34 MiB, allocs estimate: 5890904.

On master:

julia> @btime rasterize(x.geom; to=ds, missingval=0, fill=1, boundary=:touches);
  7.986 s (29741237 allocations: 939.08 MiB)

julia> @benchmark rasterize(x.geom; to=ds, missingval=0, fill=1, boundary=:touches)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.445 s …    5.336 s  ┊ GC (min … max): 4.43% … 4.82%
 Time  (median):     4.890 s               ┊ GC (median):    4.65%
 Time  (mean ± σ):   4.890 s ± 629.958 ms  ┊ GC (mean ± σ):  4.65% ± 0.27%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  4.44 s         Histogram: frequency by time         5.34 s <

 Memory estimate: 939.08 MiB, allocs estimate: 29741237.

Edit: with this PR and specialised getpoint for wkbLineString

julia> @btime rasterize($x.geom; to=$ds, missingval=0, fill=1, boundary=:touches);
344.444 ms (2964101 allocations: 71.68 MiB)

Edit2 making sure Rasters.jl allways uses the fast path:

julia> @btime rasterize(x.geom; to=ds, missingval=0, fill=1, boundary=:touches)
   216.206 ms (37298 allocations: 27.02 MiB)

At this point rasterize is 95% n PolygonInbounds.jl code, so ArchGDAL seems 2 orders of magnitude faster.

rafaqz commented 2 years ago

The abstract types in this PR are getting out of control, because we dont get dimensionality or object type from the enum, we have to use huge Union types to define everything in groups we can dispatch on.

It might be nicer to use generic type parameters to indicate the kinds of objects and if they are 2D, 3D or have M fields, and delete all the Union groups for e.g. all 3d object types, all M object types etc.

So, something like:

abstract type AbstractGeometry{E,T,D,M} end
struct Geometry{E,T,D,M} <: AbstractGeometry{E,T,D,M}
    ....
end
struct IGeometry{E,T,D,M} <: AbstractGeometry{E,T,D,M} 
    ...
end
# With default constructors like this:
Geometry{E}(x) where E = Geometry{W,geomtype(W),dimensionality(W),hasm(W)}(x)

Then, for example, the types would be:

Geometry{wkbLineStringZM,LineString,3,HasM}
Geometry{wkbPoint,Point,2,NoM}

And we can just use dispatch to match any of the type paramters instead of using so many huge Union types.

Any comments on this @yeesian @evetion ?

yeesian commented 2 years ago

I think that sounds reasonable; it feels nice to have reached a point where this doesn't feel like over-engineering. I didn't fully follow the example (E.g. would Geometry{wkbLineStringZM,Polygon,3,NoM} correspond to a valid type?) but it makes sense to have dimensionality, geomtype, etc as type parameters to me.

We might want to think a little more about whether we want to restrain it such the type parameters are independent of each other and all constructed types are valid (versus e.g. allowing for potentially invalid types to be constructed, but having it be easier for implementing the dispatch logic), and the pros and cons of the options, but I think yeah there'll definitely be something which is better than the status quo haha.

rafaqz commented 2 years ago

Probably that type shouldn't exist and we could make sure of that in the inner constructor.

We could also not include the enum in the type at all, and generate it when needed.

yeesian commented 1 year ago

We could also not include the enum in the type at all, and generate it when needed.

I think I like this more. How about something like

Geometry{wkbLineStringZM,3}
Geometry{wkbPoint,2}

And we can generate (i) LineString and HasM from (wkbLineStringZM, 3) and (ii) Point and NoM from (wkbPoint,2).

rafaqz commented 1 year ago

I actually meant the other way round:

Geometry{LineString,3,HasM}

As wkbLineString has no use in dispatch, so we would still need all the big unions I have defined.

We can get wkbLineString From the other three.

It might be better to merge this as-is when I fix the tests, and think about this separately. I'm not sure.

yeesian commented 1 year ago

I actually meant the other way round

Ah yes sorry you're right!

It might be better to merge this as-is when I fix the tests, and think about this separately. I'm not sure.

Yeah I think it'll be a big change regardless of the outcome of the proposal, and agree about thinking about it separately. Happy with this PR once the tests are fixed

rafaqz commented 1 year ago

I think this is good to go.

I found a few existing type-related problems while doing this: flattento2d! and setcoorddim! etc don't actually work because we don't update the type. Even though its and in place function we will have to use the return value in future. I'll make some issues and fix elsewhere.

rafaqz commented 1 year ago

It's still saying changes requested. @evetion do you want to check this and approve if its ok?

rafaqz commented 1 year ago

Ok, made a few changes. Measures handling was really not ready so I wound that back a little. This means GI.m and ismeasured are not tested for the true case as there isn't an obvious way to make measured objects.

I've tested most of the types more extensively for is3d, for type parameters and child type parameters. The missing lines now (besides GI.m and ismeasured) seem to be errors in coverage, like constants counted as lines of code that don't run.

yeesian commented 1 year ago

I'm going to consider the review by @evetion in https://github.com/yeesian/ArchGDAL.jl/pull/316#pullrequestreview-1072118118 as resolved by https://github.com/yeesian/ArchGDAL.jl/pull/316#issuecomment-1245931594 and the subsequent commits to increase test coverage.