yeesian / ArchGDAL.jl

A high level API for GDAL - Geospatial Data Abstraction Library
https://yeesian.github.io/ArchGDAL.jl/stable/
Other
142 stars 26 forks source link

DataFrame conversion does not work with an ESRI shape file layer containing mixed wkbLineString and wkbMultiLineString geometries #223

Closed mathieu17g closed 3 years ago

mathieu17g commented 3 years ago

In OGR the reading of an ESRI shapefile layer containing one geometry field instantiated with a mix of wkbLineString and wkbMultiLineString leads to a geomdefn of type wkbLineString instead of wkbUnknown.

This due to OGR ESRI shapefile driver :

Note that when reading a Shapefile of type SHPT_ARC, the corresponding layer will be reported as of type wkbLineString, but depending on the number of parts of each geometry, the actual type of the geometry for each feature can be either OGRLineString or OGRMultiLineString. The same applies for SHPT_POLYGON shapefiles, reported as layers of type wkbPolygon, but depending on the number of parts of each geometry, the actual type can be either OGRPolygon or OGRMultiPolygon.

With this kind of layer, with ArchGDAL.jl version 0.7.2, in tables.jl the geomtype inserted in the Tables.Schema of the Tables interface is the DataType[ArchGDAL.IGeometry{ArchGDAL.wkbLineString}]

Therefore the layer's conversion to a DataFrame raises a conversion error when handling the wkbMultiLineString features of the layer. Below an example:

julia> using ArchGDAL; const AG = ArchGDAL
ArchGDAL

julia> using DataFrames

julia> # Clean previously created files for new dataset
       isdir("mixed_geom") && rm("mixed_geom", recursive=true)

julia> # Create dataset with ESRI shapefile driver from scratch with one layer 
       # containing features with geom of type AG.wkbUnknown and two fields id (Int64) and name (String)
       # Add two features, one with geom type wkbLineString and the other with geom type wkbMultiLineString
       AG.create(
           "mixed_geom"; 
           driver=AG.getdriver("ESRI shapefile"), 
       ) do newdataset
           AG.createlayer(
               name="mixed_geom", 
               dataset=newdataset, 
               geom=AG.wkbUnknown
           ) do newlayer
               # Add field definitions to layer
               AG.createfielddefn("id", AG.OFTInteger64) do newfielddefn
                   AG.addfielddefn!(newlayer, newfielddefn)
               end
               AG.createfielddefn("name", AG.OFTString) do newfielddefn
                   AG.addfielddefn!(newlayer, newfielddefn)
               end
               # Add features to layer
               AG.addfeature(newlayer) do newfeature
                   AG.setfield!(newfeature, 0, 1)
                   AG.setfield!(newfeature, 1, "line1")
                   AG.setgeom!(newfeature, 0, AG.createlinestring([(i,i+1) for i in 1.0:3.0]))
               end
               AG.addfeature(newlayer) do newfeature
                   AG.setfield!(newfeature, 0, 2)
                   AG.setfield!(newfeature, 1, "multiline1")
                   AG.setgeom!(newfeature, AG.createmultilinestring([[(i,i+1) for i in j:j+3] for j in 1.0:5.0:6.0]))
               end
           end
       end
NULL FeatureLayer

julia> DataFrame(AG.getlayer(AG.read("mixed_geom"), 0))
ERROR: MethodError: Cannot `convert` an object of type 
  ArchGDAL.IGeometry{wkbMultiLineString} to an object of type 
  ArchGDAL.IGeometry{wkbLineString}
Closest candidates are:
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GML) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:48
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GeoJSON) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:45
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.WellKnownBinary) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:42
  ...
Stacktrace:
 [1] setindex!(A::Vector{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}, x::ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}, i1::Int64)
   @ Base ./array.jl:839
 [2] add!
   @ ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:127 [inlined]
 [3] eachcolumns
   @ ~/.julia/packages/Tables/gg6Id/src/utils.jl:111 [inlined]
 [4] buildcolumns(schema::Tables.Schema{(Symbol(""), :id, :name), Tuple{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, Int64, String}}, rowitr::ArchGDAL.IFeatureLayer)
   @ Tables ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:135
 [5] columns
   @ ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:253 [inlined]
 [6] DataFrame(x::ArchGDAL.IFeatureLayer; copycols::Nothing)
   @ DataFrames ~/.julia/packages/DataFrames/vQokV/src/other/tables.jl:58
 [7] DataFrame(x::ArchGDAL.IFeatureLayer)
   @ DataFrames ~/.julia/packages/DataFrames/vQokV/src/other/tables.jl:49
 [8] top-level scope
   @ REPL[10]:1

It used to work with version 0.6.0. The geomtype inserted in the Tables.Schema of the Tables interface was just ArchGDAL.IGeometry

I have tried with the GeoJSON driver and ArchGDAL.jl version 0.7.2, there is no issues since the driver does not enforce the geom type of the layerdefn to ArchGDAL.wkbLineString after feature creations, and keeps it at ArchGDAL.wkbUnknown

Maybe, in function Tables.schemaof tables.jl, the type of each feature instance could be checked and in case of mixed geometry types the geometry type in Tables.Schema set to DataType[ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}] or a Union of all the mixed geometry types ?

visr commented 3 years ago

Thanks for the clear report!

So do I understand correctly that for shapefiles in particular, the geometry type reported by GDAL can in some cases just be wrong? As far as I can see in https://github.com/yeesian/ArchGDAL.jl/blob/51f22608470e41f9c14e6a3b9372e6719bef9e1a/src/tables.jl we don't only look at the first feature, but use ogr_gfld_gettype, right?

Having to go over all features to determine the schema doesn't sound great. Would it work if, for shapefiles only, we just always return a union of wkbLineString and wkbMultiLineString if the geometry type is reported as wkbLineString, and return a union of wkbPolygon and wkbMultiPolygon if the geometry type is reported as wkbPolygon? Or alternatively wkbUnknown in either case.

mathieu17g commented 3 years ago

It seems efficient to handle the special case of the "ESRI shapefile" driver for wkbLineString and wkbPolygon

Do you know how to get a layer's owner (dataset), in order to get its driver and avoid passing the dataset to Tables.schema function ?

visr commented 3 years ago

Do you know how to get a layer's owner (dataset)

The ownedby field has a reference to this. But since getproperty is overloaded for these types, you'll need to use getfield(layer, :ownedby).

https://github.com/yeesian/ArchGDAL.jl/blob/51f22608470e41f9c14e6a3b9372e6719bef9e1a/src/types.jl#L118-L121

I was also just thinking, instead of a union or an unknown, can we not just set the geometry type for shapefile linestring into multilinestring, and polygon into multipolygon? This is very much related, https://github.com/JuliaGeo/Shapefile.jl/pull/40, where the Shapefile Polygon is read to the GeometryBasics MultiPolygon because in the Shapefile it can have multiple outer rings.

yeesian commented 3 years ago

Sorry I'm not currently available to work on this, but happy to review a pull request.

mathieu17g commented 3 years ago

@visr

julia> using ArchGDAL; const AG = ArchGDAL
[ Info: Precompiling ArchGDAL [c9ce4bd3-c3d5-55b8-8973-c0e20141b8c3]
ArchGDAL

julia> ls = AG.createlinestring([(i,i+1) for i in 1.0:3.0])
       convert(AG.IGeometry{AG.wkbMultiLineString}, ls)
ERROR: MethodError: Cannot `convert` an object of type 
  ArchGDAL.IGeometry{wkbLineString} to an object of type 
  ArchGDAL.IGeometry{wkbMultiLineString}
Closest candidates are:
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GML) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:48
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GeoJSON) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:45
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.WellKnownBinary) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:42
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[39]:2

However, converting wkbLineString or wbkMultiLineStringto a wkbUnknown works

julia> ls = AG.createlinestring([(i,i+1) for i in 1.0:3.0])
       convert(AG.IGeometry{AG.wkbUnknown}, ls)
Geometry: LINESTRING (1 2,2 3,3 4)

julia> mls = AG.createmultilinestring([[(i,i+1) for i in j:j+3] for j in 1.0:5.0:6.0])
       convert(AG.IGeometry{AG.wkbUnknown}, mls)
Geometry: MULTILINESTRING ((1 2,2 3,3 4,4 5),(6 7,7 8,8 9,9 10))

Option 0: geom type set to IGeometry{AG.wkbUnknown} in Tables.Schema

The conversion of a layer with mixed wkbLineStringand wkbMultiLineString features and a layerdefn with wbkLineString geometry type, works also when setting the geomtype to wkbUnkown in the Tables.Schema, providing that the layer does not contain any empty geometry

julia> """
           get_mixed_geom_ds_with_shp_driver_and_built_from_scratch()::AG.IDataset

       Build a ESRI shapefile from scratch with one layer containing features with:
       - one geom of type `wkbUnknown`,
       - two fields `id::Int64` and `name::String`

       Add two features, one with geom type `wkbLineString` and the other with geom type `wkbMultiLineString`

       If `withmissing=true`, add a feature without any geometry
       Returns the corresponding dataset
       """
       function get_mixed_geom_ds_with_shp_driver_and_built_from_scratch(;
           drvshortname::AbstractString="ESRI shapefile", 
           withmissing::Bool
       )::AG.IDataset
           # Clean previously created files for new dataset           
           isdir("mixed_geom") && rm("mixed_geom", recursive=true)
           # Create dataset with ESRI shapefile driver from scratch with one layer 
           # containing features with geom of type AG.wkbUnknown and two fields id (Int64) and name (String)
           # Add two features, one with geom type wkbLineString and the other with geom type wkbMultiLineString
           AG.create(
               "mixed_geom"; 
               driver=AG.getdriver(drvshortname), 
           ) do newdataset
               AG.createlayer(
                   name="mixed_geom", 
                   dataset=newdataset, 
                   geom=AG.wkbUnknown
               ) do newlayer
                   # @show AG.layerdefn(newlayer)
                   # Add field definitions to layer
                   AG.createfielddefn("id", AG.OFTInteger64) do newfielddefn
                       AG.addfielddefn!(newlayer, newfielddefn)
                   end
                   AG.createfielddefn("name", AG.OFTString) do newfielddefn
                       AG.addfielddefn!(newlayer, newfielddefn)
                   end
                   # Add features to layer
                   AG.addfeature(newlayer) do newfeature
                       AG.setfield!(newfeature, 0, 1)
                       AG.setfield!(newfeature, 1, "line")
                       AG.setgeom!(newfeature, 0, AG.createlinestring([(i,i+1) for i in 1.0:3.0]))
                   end                   AG.addfeature(newlayer) do newfeature
                       AG.setfield!(newfeature, 0, 2)
                       AG.setfield!(newfeature, 1, "multilineline")
                       AG.setgeom!(newfeature, AG.createmultilinestring([[(i,i+1) for i in j:j+3] for j in 1.0:5.0:6.0]))
                   end
                   if withmissing
                       AG.addfeature(newlayer) do newfeature
                           AG.setfield!(newfeature, 0, 3)
                           AG.setfield!(newfeature, 1, "emptygeom")
                       end
                   end
               end
           end

           return AG.read("mixed_geom")
       end
get_mixed_geom_ds_with_shp_driver_and_built_from_scratch

julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
           geom_names, field_names, featuredefn, fielddefns =
               AG.schema_names(AG.layerdefn(layer))
           ngeom = ArchGDAL.ngeom(featuredefn)
           if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
               geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
           else
               geom_types = (AG.IGeometry{AG.wkbUnknown} for _ in 1:ngeom)
           end
           field_types =
               (convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
           return Tables.Schema(
               (geom_names..., field_names...),
               (geom_types..., field_types...),
           )
       end

julia> using DataFrames

julia> ds = get_mixed_geom_ds_with_shp_driver_and_built_from_scratch(;withmissing=false)
       layer = AG.getlayer(ds, 0)
Layer: mixed_geom
  Geometry 0 (): [wkbLineString], LINESTRING (1 2,2 3,3 4), ...
     Field 0 (id): [OFTInteger64], 1, 2
     Field 1 (name): [OFTString], line, multilineline

julia> df = DataFrame(layer)
2×3 DataFrame
 Row │                       id     name          
     │ IGeometr…             Int64  String        
─────┼────────────────────────────────────────────
   1 │ Geometry: wkbUnknown      1  line
   2 │ Geometry: wkbUnknown      2  multilineline

julia> eltype(df[:, 1])
ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}

julia> df[1, 1]
Geometry: LINESTRING (1 2,2 3,3 4)

But it fails when the layer have missing geometries

julia> ds = get_mixed_geom_ds_with_shp_driver_and_built_from_scratch(;withmissing=true)
       layer = AG.getlayer(ds, 0)
Layer: mixed_geom
  Geometry 0 (): [wkbLineString], LINESTRING (1 2,2 3,3 4), ...
     Field 0 (id): [OFTInteger64], 1, 2, 3
     Field 1 (name): [OFTString], line, multilineline, emptygeom

julia> DataFrame(layer)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}
Closest candidates are:
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GML) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:48
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GeoJSON) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:45
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.WellKnownBinary) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:42
  ...
Stacktrace:
 [1] setindex!(A::Vector{ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}}, x::Missing, i1::Int64)
   @ Base ./array.jl:839
 [2] add!
   @ ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:127 [inlined]
 [3] eachcolumns
   @ ~/.julia/packages/Tables/gg6Id/src/utils.jl:111 [inlined]
 [4] buildcolumns(schema::Tables.Schema{(Symbol(""), :id, :name), Tuple{ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}, Int64, String}}, rowitr::ArchGDAL.IFeatureLayer)
   @ Tables ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:135
 [5] columns
   @ ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:253 [inlined]
 [6] DataFrame(x::ArchGDAL.IFeatureLayer; copycols::Nothing)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/other/tables.jl:58
 [7] DataFrame(x::ArchGDAL.IFeatureLayer)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/other/tables.jl:49
 [8] top-level scope
   @ REPL[74]:1

Option 1: Tables.schema returns nothing

There is an easy solution consisting in returning nothing in Tables.schema. But the DataFrame geometry column's type will be Union{Missing, ArchGDAL.IGeometry}

julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
           geom_names, field_names, featuredefn, fielddefns =
               AG.schema_names(AG.layerdefn(layer))
           ngeom = ArchGDAL.ngeom(featuredefn)
           if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
               geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
           else
               return nothing
           end
           field_types =
               (convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
           return Tables.Schema(
               (geom_names..., field_names...),
               (geom_types..., field_types...),
           )
       end

julia> df = DataFrame(layer)
3×3 DataFrame
 Row │                               id     name          
     │ IGeometr…?                    Int64  String        
─────┼────────────────────────────────────────────────────
   1 │ Geometry: wkbLineString           1  line
   2 │ Geometry: wkbMultiLineString      2  multilineline
   3 │ missing                           3  emptygeom

julia> eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry}

julia> df[1, 1]
Geometry: LINESTRING (1 2,2 3,3 4)

Option 2: geom types set to Union of all features geometry types

Since the modifications introduced recently in ArchGDAL.jl seemed to have detailed geometry types, we can build the geometry column's type through Tables.schema by extracting a Union from all features' geometry types

julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
           geom_names, field_names, featuredefn, fielddefns =
               AG.schema_names(AG.layerdefn(layer))
           ngeom = ArchGDAL.ngeom(featuredefn)
           if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
               geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
           else
               geom_types_sets = Vector{Set{DataType}}(undef, ngeom)
               fill!(geom_types_sets, Set{DataType}())
               for j in 1:ngeom
                   for i in 0:ArchGDAL.nfeature(layer)-1
                       AG.getfeature(layer, i) do feature
                           feature_type::DataType = AG.IGeometry{AG.getgeomtype(AG.getgeom(feature))}
                           feature_type == AG.IGeometry{AG.wkbUnknown} && (feature_type = Missing)
                           push!(geom_types_sets[j], feature_type)
                       end
                   end
               end
               geom_types = (Union{gts...} for gts in geom_types_sets)
           end
           field_types =
               (convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
           return Tables.Schema(
               (geom_names..., field_names...),
               (geom_types..., field_types...),
           )
       end

julia> df = DataFrame(layer)
3×3 DataFrame
 Row │                               id     name          
     │ Union…?                       Int64  String        
─────┼────────────────────────────────────────────────────
   1 │ Geometry: wkbLineString           1  line
   2 │ Geometry: wkbMultiLineString      2  multilineline
   3 │ missing                           3  emptygeom

julia> eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}

julia> df[1, 1]
Geometry: LINESTRING (1 2,2 3,3 4)

Performance

Benchmarking with the road.shp ESRI shapefile that can be found here (French motorways) we can see that Option 2 has a 20% performance overhead.

julia> road_ds = AG.read("road/road.shp")
       road_layer = AG.getlayer(road_ds, 0)
Layer: road
  Geometry 0 (): [wkbLineString], LINESTRING (874644.4...), ...
     Field 0 (gid): [OFTInteger], 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
     Field 1 (roadcode): [OFTString], 01A903905CD, 01A903905CD, 01A903905CD, ...
     Field 2 (plod): [OFTString], DB1, DB2, DB3, DB4, DB1, DB2, DB1, DB2, ...
     Field 3 (absd): [OFTReal], 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     Field 4 (plof): [OFTString], FB1, FB2, FB3, FB4, FB1, FB2, FB1, FB2, ...
...
 Number of Fields: 7

julia> using BenchmarkTools

julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
           geom_names, field_names, featuredefn, fielddefns =
               AG.schema_names(AG.layerdefn(layer))
           ngeom = ArchGDAL.ngeom(featuredefn)
           if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
               geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
           else
               return nothing
           end
           field_types =
               (convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
           return Tables.Schema(
               (geom_names..., field_names...),
               (geom_types..., field_types...),
           )
       end

julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  685.176 ms … 725.277 ms  ┊ GC (min … max): 5.65% … 6.50%
 Time  (median):     706.511 ms               ┊ GC (median):    6.55%
 Time  (mean ± σ):   705.000 ms ±  14.271 ms  ┊ GC (mean ± σ):  6.46% ± 1.18%
  ▁    ▁         ▁           ▁         ▁         █            ▁  
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  685 ms           Histogram: frequency by time          725 ms <

 Memory estimate: 237.74 MiB, allocs estimate: 3035032.

julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
           geom_names, field_names, featuredefn, fielddefns =
               AG.schema_names(AG.layerdefn(layer))
           ngeom = ArchGDAL.ngeom(featuredefn)
           if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
               geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
           else
               geom_types_sets = Vector{Set{DataType}}(undef, ngeom)
               fill!(geom_types_sets, Set{DataType}())
               for j in 1:ngeom
                   for i in 0:ArchGDAL.nfeature(layer)-1
                       AG.getfeature(layer, i) do feature
                           feature_type::DataType = AG.IGeometry{AG.getgeomtype(AG.getgeom(feature))}
                           feature_type == AG.IGeometry{AG.wkbUnknown} && (feature_type = Missing)
                           push!(geom_types_sets[j], feature_type)
                       end
                   end
               end
               geom_types = (Union{gts...} for gts in geom_types_sets)
           end
           field_types =
               (convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
           return Tables.Schema(
               (geom_names..., field_names...),
               (geom_types..., field_types...),
           )
       end

julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  816.748 ms … 843.214 ms  ┊ GC (min … max): 6.17% … 8.40%
 Time  (median):     829.371 ms               ┊ GC (median):    7.39%
 Time  (mean ± σ):   830.147 ms ±   9.908 ms  ┊ GC (mean ± σ):  7.28% ± 0.99%
  █         █             █    ██                            ██  
  █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
  817 ms           Histogram: frequency by time          843 ms <

 Memory estimate: 270.52 MiB, allocs estimate: 3250798.

Browsing DataFrames.jl code, I haven't been able to find the use of the Tables.Schema, so I don't know if there any side effects with option 1.

And I don't know the benefit of having the detailed type in the DataFrame geom column type, as with option 2.

If one of the above options suits you, I can make a Pull Request

yeesian commented 3 years ago

I'm kind of leaning towards option 1 (of the three options in https://github.com/yeesian/ArchGDAL.jl/issues/223#issuecomment-888002541). @visr @evetion do you have any thoughts on the matter? In the case of a multi-type column, my hunch is that it isn't very helpful to dispatch on the column's type, and it isn't as important to be super precise about the column type for option 2 to be worth it.

evetion commented 3 years ago

Compliments for the research @mathieu17g ! 🥇

I'm not sure which option is best. I just like to automatically handle geometries, for example, make sure that all multipolygons are flattened to single polygons where applicable using dispatch. Neither option 1 (missing?) and 2 (union) will really help there, but option 2 seems the more correct of the two.

mathieu17g commented 3 years ago

There is a way to get option 1's performance (+~5% overhead) and option 2's detailed geometry. It is to define Base.promote_rule for each combination of IGeometry subtypes. Promote rules could be implemented with a macro similar to one used to implement convert functions.

Note: I don't see the benefit of implementing the Tables.schema function. See at the end of my comment the benchmark for a cleaned road.shp, between master's Table.schema and a one returning nothing

With the same shapefile as used in previous comment:

julia> road_ds = AG.read("road/road.shp")
       road_layer = AG.getlayer(road_ds, 0)
Layer: road
  Geometry 0 (): [wkbLineString], LINESTRING (874644.4...), ...
     Field 0 (gid): [OFTInteger], 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
     Field 1 (roadcode): [OFTString], 01A903905CD, 01A903905CD, 01A903905CD, ...
     Field 2 (plod): [OFTString], DB1, DB2, DB3, DB4, DB1, DB2, DB1, DB2, ...
     Field 3 (absd): [OFTReal], 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
     Field 4 (plof): [OFTString], FB1, FB2, FB3, FB4, FB1, FB2, FB1, FB2, ...
...
 Number of Fields: 7

julia> using BenchmarkTools

Option 1

julia> df = DataFrame(road_layer)
       eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry}

julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  600.909 ms … 618.806 ms  ┊ GC (min … max): 6.83% … 7.10%
 Time  (median):     612.044 ms               ┊ GC (median):    6.88%
 Time  (mean ± σ):   611.173 ms ±   6.829 ms  ┊ GC (mean ± σ):  7.03% ± 0.97%

  █       █      █           █         █        █           ███  
  █▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁███ ▁
  601 ms           Histogram: frequency by time          619 ms <

 Memory estimate: 237.74 MiB, allocs estimate: 3035029.

Option 1 with promote_rule

julia> Base.promote_rule(::Type{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}, ::Type{ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}) = Union{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}

julia> df = DataFrame(road_layer)
       eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}

julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  615.584 ms … 654.362 ms  ┊ GC (min … max): 5.50% … 8.12%
 Time  (median):     641.108 ms               ┊ GC (median):    6.62%
 Time  (mean ± σ):   639.500 ms ±  12.582 ms  ┊ GC (mean ± σ):  7.00% ± 1.12%

  █                        █ █        █      █     █      █   █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁█ ▁
  616 ms           Histogram: frequency by time          654 ms <

 Memory estimate: 237.74 MiB, allocs estimate: 3035027.

Option 2

julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
           geom_names, field_names, featuredefn, fielddefns =
               AG.schema_names(AG.layerdefn(layer))
           ngeom = ArchGDAL.ngeom(featuredefn)
           if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
               geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
           else
               geom_types_sets = Vector{Set{DataType}}(undef, ngeom)
               fill!(geom_types_sets, Set{DataType}())
               for j in 1:ngeom
                   for i in 0:ArchGDAL.nfeature(layer)-1
                       AG.getfeature(layer, i) do feature
                           feature_type::DataType = AG.IGeometry{AG.getgeomtype(AG.getgeom(feature))}
                           feature_type == AG.IGeometry{AG.wkbUnknown} && (feature_type = Missing)
                           push!(geom_types_sets[j], feature_type)
                       end
                   end
               end
               geom_types = (Union{gts...} for gts in geom_types_sets)
           end
           field_types =
               (convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
           return Tables.Schema(
               (geom_names..., field_names...),
               (geom_types..., field_types...),
           )
       end

julia> df = DataFrame(road_layer)
       eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}

julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  773.121 ms … 802.173 ms  ┊ GC (min … max): 5.94% … 8.23%
 Time  (median):     787.602 ms               ┊ GC (median):    7.57%
 Time  (mean ± σ):   788.342 ms ±  11.600 ms  ┊ GC (mean ± σ):  7.83% ± 1.21%
  █    █                █       █                 █       █   █  
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁█ ▁
  773 ms           Histogram: frequency by time          802 ms <

 Memory estimate: 270.52 MiB, allocs estimate: 3250798.

Is Tables.Schema useful ?

In the end, I don't see the benefit of implementing the Tables.schema function. I have checked the difference between building a Tables.Schema for the shapefile road.shp stripped of features with wkbMultiLineString and missing geoms. It is not signifiant ( < σ )

julia> @benchmark DataFrame($layer)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  572.249 ms … 600.157 ms  ┊ GC (min … max): 5.75% … 8.35%
 Time  (median):     588.713 ms               ┊ GC (median):    7.96%
 Time  (mean ± σ):   587.146 ms ±   8.671 ms  ┊ GC (mean ± σ):  7.58% ± 1.00%

  █              █ █           █     █     █  █  █            █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  572 ms           Histogram: frequency by time          600 ms <

 Memory estimate: 236.63 MiB, allocs estimate: 3028884.

julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
           return nothing
       end

julia> @benchmark DataFrame($layer)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  578.874 ms … 602.971 ms  ┊ GC (min … max): 6.94% … 7.42%
 Time  (median):     593.821 ms               ┊ GC (median):    7.95%
 Time  (mean ± σ):   590.695 ms ±   9.017 ms  ┊ GC (mean ± σ):  7.70% ± 1.03%

  █     █  █  █                        █    █      ██         █  
  █▁▁▁▁▁█▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁█ ▁
  579 ms           Histogram: frequency by time          603 ms <

 Memory estimate: 236.00 MiB, allocs estimate: 3015267.
visr commented 3 years ago

Thanks for the investigations @mathieu17g. I find it hard to really oversee this issue and its context. Let me try writing it out a bit.

The main related PR here is #160. Before this there was only Geometry (like in GDAL) and this was reflected in Tables.schema. In that PR we added an enum in the type parameter, Geometry{OGRwkbGeometryType}, to be able to give more information for dispatch. I believe this type parameter is not currently used by ArchGDAL, but would be of interest to downstream packages or users.

Now this extra information seems to be leading to this issue, for shapefiles with mixed geometries. When using the Tables API to convert to a DataFrame, DataFrames will use Tables.schema to prepare a Vector{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}} to hold all the geometries. This is a documented usecase of Tables.schema. The wkbLineString here is reported by GDAL itself but does not seem fully correct as documented in the shapefile driver docs. So then when it loops over the features and encounters a wkbMultiLineString and tries to put it in the geometry column it fails because the type does not match. In reality this should be no issue since they are all just Ptr{Cvoid} anyway, through here and here.

If geometries have to really be converted, there is ArchGDAL.forceto, but that is not the case here.

Is Tables.Schema useful?

In this benchmark indeed it is not. I don't have an example where it does make a big difference. The Tables documentation mentions it is especially useful for constructing similar tables. In principle we have the information to provide a schema, unfortunately in the case of shapefile it just may be wrong. I think my favorite options are 1 (without promote_rule), so give up on schema for shapefiles only. It is a bit unfortunate to have different behavior depending on the driver though, GDAL normally abstracts that. The other option is to consider reverting #160 altogether, though that would be more breaking, so we may want to avoid that.

I don't understand the value of defining promote_rule to the union of the two types. Isn't it supposed to pick the more general type that can represent both? That would be MultiLineString in case of LineString and MultiLineString.

mathieu17g commented 3 years ago

So then when it loops over the features and encounters a wkbMultiLineString and tries to put it in the geometry column it fails because the type does not match

It also fails for a feature with a missing geometry. It happens also for the "GeoJSON" driver. So the issue is not only about mixed LineString and MultiLineString or Polygon and MultiPolygon geometries allowed by the "ESRI Shapefile" driver with a featuredefn type set to LineString or MultiLineString, the missing geometry case has to be handled.

julia> open("missing_geom.geojson", "w") do f 
          print(f, 
               """{
       "type": "FeatureCollection",
       "name": "test_geoms",
       "features": [
       { "type": "Feature", "properties": { "id": 1, "name": "line1" }, "geometry": { "type": "LineString", "coordinates": [ [ 1.0, 2.0 ], [ 2.0, 3.0 ], [ 3.0, 4.0 ] ] } },
       { "type": "Feature", "properties": { "id": 2, "name": "line2" }, "geometry": { "type": "LineString", "coordinates": [ [ 3.0, 4.0 ], [ 4.0, 5.0 ], [ 5.0, 6.0 ] ] } },
       { "type": "Feature", "properties": { "id": 3, "name": "emptygeom" }, "geometry": null }
       ]
       }       """)
       end
       print(read("missing_geom.geojson", String))
{
"type": "FeatureCollection",
"name": "test_geoms",
"features": [
{ "type": "Feature", "properties": { "id": 1, "name": "line1" }, "geometry": { "type": "LineString", "coordinates": [ [ 1.0, 2.0 ], [ 2.0, 3.0 ], [ 3.0, 4.0 ] ] } },
{ "type": "Feature", "properties": { "id": 2, "name": "line2" }, "geometry": { "type": "LineString", "coordinates": [ [ 3.0, 4.0 ], [ 4.0, 5.0 ], [ 5.0, 6.0 ] ] } },
{ "type": "Feature", "properties": { "id": 3, "name": "emptygeom" }, "geometry": null }
]
}

julia> ds = AG.read("missing_geom.geojson")
GDAL Dataset (Driver: GeoJSON/GeoJSON)
File(s): 
  missing_geom.geojson

Number of feature layers: 1
  Layer 0: test_geoms (wkbLineString)

julia> layer = AG.getlayer(ds, 0)
Layer: test_geoms
  Geometry 0 (): [wkbLineString], LINESTRING (1 2,2 3,3 4), ...
     Field 0 (id): [OFTInteger], 1, 2, 3
     Field 1 (name): [OFTString], line1, line2, emptygeom

julia> using DataFrames

julia> DataFrame(layer)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type ArchGDAL.IGeometry{ArchGDAL.wkbLineString}
Closest candidates are:
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GML) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:48
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GeoJSON) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:45
  convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.WellKnownBinary) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:42
  ...
Stacktrace:
 [1] setindex!(A::Vector{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}, x::Missing, i1::Int64)
   @ Base ./array.jl:839
 [2] add!
   @ ~/.julia/packages/Tables/E99hk/src/fallbacks.jl:127 [inlined]
 [3] eachcolumns
   @ ~/.julia/packages/Tables/E99hk/src/utils.jl:111 [inlined]
 [4] buildcolumns(schema::Tables.Schema{(Symbol(""), :id, :name), Tuple{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, Int32, String}}, rowitr::ArchGDAL.IFeatureLayer)
   @ Tables ~/.julia/packages/Tables/E99hk/src/fallbacks.jl:135
 [5] columns
   @ ~/.julia/packages/Tables/E99hk/src/fallbacks.jl:253 [inlined]
 [6] DataFrame(x::ArchGDAL.IFeatureLayer; copycols::Nothing)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/other/tables.jl:58
 [7] DataFrame(x::ArchGDAL.IFeatureLayer)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/other/tables.jl:49
 [8] top-level scope
   @ REPL[8]:1

In version 0.6.0 there were no missing value in the table. Instead getgeom in function nextnamedtuple was returning a IGeometry() when GDAL.ogr_f_getgeometryref(feature.ptr) == C_NULL, which is of IGeometry type In version 0.7.2 function Tables.getcolumn returns missing in case of missing geometry when geom.ptr == C_NULL

missing values are useful according to me, but they have to be handled on the Tables interface

By the way, missing values could also be set instead of "" in getdefault for fields whose fielddefn does not have any default value set

In reality this should be no issue since they are all just Ptr{Cvoid} anyway, through here and here.

If geometries have to really be converted, there is ArchGDAL.forceto, but that is not the case here.

Definitely :-)

Is Tables.Schema useful?

In this benchmark indeed it is not. I don't have an example where it does make a big difference. The Tables documentation mentions it is especially useful for constructing similar tables.

I have not found this in Tables.jl, but here in ArchGDAL Tables.columns(::ArchGDAL.IFeatureLayer) and Tables.rows(::ArchGDAL.IFeatureLayer) are not implemented.

Instead we rely on Tables.jl/src/fallback.jl implementations Tables.columns(x::T) where {T} and rows(x::T) where {T}.

A user of ArchGDAL Tables interface will most probably call one of those two generated functions which returns, besides columns or rows, a Tables.Schema generated if Tables.schema(::ArchGDAL.IFeatureLayer) returns nothing.

It is the case in DataFrames.jl there is no call to Tables.schema(::ArchGDAL.IFeatureLayer) before calling Tables.columns(::ArchGDAL.IFeatureLayer) in DataFrame(x::T; copycols::Union{Nothing, Bool}=nothing) where {T}. Maybe there are other users using Tables.schema before calling Tables.columns or Tables.rows. I will ask Tables.jl developers if we want to be thorough (https://github.com/JuliaData/Tables.jl/issues/245#issue-958855882)

If we assume that Tables.schema(::ArchGDAL.IFeatureLayer) is not used by any user of ArchGDAL Tables interface, the only optimization that it would bring is between buildcolumns(::Nothing, rowitr::T) where {T} and buildcolumns(schema, rowitr::T) where {T} in Tables.jl/src/fallback.jl.

This leads to implementing on ArchGDAL.jl side, at least to handle the missing geometry case, something faster than buildcolumns(::Nothing, rowitr::T) where {T} -> _buildcolumns -> eachcolumns / add_or_widen! -> __buildcolumns. I doubt that it could be done easily.

I don't understand the value of defining promote_rule to the union of the two types. Isn't it supposed to pick the more general type that can represent both? That would be MultiLineString in case of LineString and MultiLineString.

I agree, it looks like a dirty hack

Conclusion: unless Tables.jl developers point to any more concrete use of Tables.schema(::AG.IFeatureLayer) returning a Tables.Schema, I think it is best to return nothing all the time

Edit: Well I got some feedback from @quinnj (owner of Tables.jl) https://github.com/JuliaData/Tables.jl/issues/245#issuecomment-891814745 It seems that it may not be a bad idea to return nothing in Tables.schema(::AG.IFeatureLayer).
A Tables.Schema could be built within ArchGDAL.jl Table interface which may suits geometry handling needs in tabular data format better than the 'Tables.Schema' built in Tables.jl/src/fallback.jl. But since you are still discussing the best way to represent geometry objects within ArchGDAL.jl, it may worth postponing for later.

visr commented 3 years ago

It seems that it may not be a bad idea to return nothing in Tables.schema(::AG.IFeatureLayer).

Indeed sounds like it. I'd support that. Would be nice to switch to this since now the tables interface is broken for all the situations you indicated (not just shapefile). @evetion @yeesian any objections?

yeesian commented 3 years ago

I'd support returning nothing in Tables.schema(::AG.IFeatureLayer) (i.e. option 1 of https://github.com/yeesian/ArchGDAL.jl/issues/223#issuecomment-888002541) too.

visr commented 3 years ago

@yeesian note that the option 1 you linked to only returns nothing for shapefiles, whereas now the suggestion is to always return nothing, i.e. don't provide a schema. This because there is not only an issue with shapefiles, but also missing data for instance. GDAL doesn't really seem to support a fully accurate schema, therefore it is probably better to opt out and let the data speak for itself.

yeesian commented 3 years ago

Ah thanks for clarifying -- yeah it makes sense to return nothing for all the drivers since that's needed for not failing in the case of missing data.

mathieu17g commented 3 years ago

OK, I'm working on a PR with a few test cases testing from dataset creation to DataFrame. The existing test sets are already OK.

What about my proposition in https://github.com/yeesian/ArchGDAL.jl/issues/223#issuecomment-891589533

By the way, missing values could also be set instead of "" in getdefault for fields whose fielddefn does not have any default value set

Shall I make a second PR ?

visr commented 3 years ago

Great, thanks for working on the PR!

By the way, missing values could also be set instead of "" in getdefault for fields whose fielddefn does not have any default value set

Sorry, I missed this. To me it indeed sounds better to return missing instead of the empty string here. The GDAL docs say the return value is "default field value or NULL". I don't see getdefault being used in the tables interface. Is it supposed to be used there?

mathieu17g commented 3 years ago

The GDAL docs say the return value is "default field value or NULL". I don't see getdefault being used in the tables interface. Is it supposed to be used there?

Tables.jl/src/fallback.jl buildcolumns function with schema = nothing, will push missing to the Table

EDIT: Furthermore, it appears to be necessary with the GML driver to handle the NULL geometry case in geomname(::AbstractGeometry) and return missing. I will add it to the PR