Closed mathieu17g closed 3 years ago
Thanks for the clear report!
So do I understand correctly that for shapefiles in particular, the geometry type reported by GDAL can in some cases just be wrong? As far as I can see in https://github.com/yeesian/ArchGDAL.jl/blob/51f22608470e41f9c14e6a3b9372e6719bef9e1a/src/tables.jl we don't only look at the first feature, but use ogr_gfld_gettype
, right?
Having to go over all features to determine the schema doesn't sound great. Would it work if, for shapefiles only, we just always return a union of wkbLineString and wkbMultiLineString if the geometry type is reported as wkbLineString, and return a union of wkbPolygon and wkbMultiPolygon if the geometry type is reported as wkbPolygon? Or alternatively wkbUnknown in either case.
It seems efficient to handle the special case of the "ESRI shapefile" driver for wkbLineString
and wkbPolygon
Do you know how to get a layer's owner (dataset), in order to get its driver and avoid passing the dataset to Tables.schema
function ?
Do you know how to get a layer's owner (dataset)
The ownedby
field has a reference to this. But since getproperty
is overloaded for these types, you'll need to use getfield(layer, :ownedby)
.
I was also just thinking, instead of a union or an unknown, can we not just set the geometry type for shapefile linestring into multilinestring, and polygon into multipolygon? This is very much related, https://github.com/JuliaGeo/Shapefile.jl/pull/40, where the Shapefile Polygon is read to the GeometryBasics MultiPolygon because in the Shapefile it can have multiple outer rings.
Sorry I'm not currently available to work on this, but happy to review a pull request.
@visr
Thank you for pointing to Base.getfield
Regarding the idea of setting geometry type for shapefile linestring into multilinestring, and polygon into multipolygon, it may not work since one cannot convert a wkbLineString
to wkbMultiLineString
julia> using ArchGDAL; const AG = ArchGDAL
[ Info: Precompiling ArchGDAL [c9ce4bd3-c3d5-55b8-8973-c0e20141b8c3]
ArchGDAL
julia> ls = AG.createlinestring([(i,i+1) for i in 1.0:3.0])
convert(AG.IGeometry{AG.wkbMultiLineString}, ls)
ERROR: MethodError: Cannot `convert` an object of type
ArchGDAL.IGeometry{wkbLineString} to an object of type
ArchGDAL.IGeometry{wkbMultiLineString}
Closest candidates are:
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GML) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:48
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GeoJSON) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:45
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.WellKnownBinary) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:42
...
Stacktrace:
[1] top-level scope
@ REPL[39]:2
However, converting wkbLineString
or wbkMultiLineString
to a wkbUnknown
works
julia> ls = AG.createlinestring([(i,i+1) for i in 1.0:3.0])
convert(AG.IGeometry{AG.wkbUnknown}, ls)
Geometry: LINESTRING (1 2,2 3,3 4)
julia> mls = AG.createmultilinestring([[(i,i+1) for i in j:j+3] for j in 1.0:5.0:6.0])
convert(AG.IGeometry{AG.wkbUnknown}, mls)
Geometry: MULTILINESTRING ((1 2,2 3,3 4,4 5),(6 7,7 8,8 9,9 10))
IGeometry{AG.wkbUnknown}
in Tables.SchemaThe conversion of a layer with mixed wkbLineString
and wkbMultiLineString
features and a layerdefn with wbkLineString
geometry type, works also when setting the geomtype to wkbUnkown
in the Tables.Schema
, providing that the layer does not contain any empty geometry
julia> """
get_mixed_geom_ds_with_shp_driver_and_built_from_scratch()::AG.IDataset
Build a ESRI shapefile from scratch with one layer containing features with:
- one geom of type `wkbUnknown`,
- two fields `id::Int64` and `name::String`
Add two features, one with geom type `wkbLineString` and the other with geom type `wkbMultiLineString`
If `withmissing=true`, add a feature without any geometry
Returns the corresponding dataset
"""
function get_mixed_geom_ds_with_shp_driver_and_built_from_scratch(;
drvshortname::AbstractString="ESRI shapefile",
withmissing::Bool
)::AG.IDataset
# Clean previously created files for new dataset
isdir("mixed_geom") && rm("mixed_geom", recursive=true)
# Create dataset with ESRI shapefile driver from scratch with one layer
# containing features with geom of type AG.wkbUnknown and two fields id (Int64) and name (String)
# Add two features, one with geom type wkbLineString and the other with geom type wkbMultiLineString
AG.create(
"mixed_geom";
driver=AG.getdriver(drvshortname),
) do newdataset
AG.createlayer(
name="mixed_geom",
dataset=newdataset,
geom=AG.wkbUnknown
) do newlayer
# @show AG.layerdefn(newlayer)
# Add field definitions to layer
AG.createfielddefn("id", AG.OFTInteger64) do newfielddefn
AG.addfielddefn!(newlayer, newfielddefn)
end
AG.createfielddefn("name", AG.OFTString) do newfielddefn
AG.addfielddefn!(newlayer, newfielddefn)
end
# Add features to layer
AG.addfeature(newlayer) do newfeature
AG.setfield!(newfeature, 0, 1)
AG.setfield!(newfeature, 1, "line")
AG.setgeom!(newfeature, 0, AG.createlinestring([(i,i+1) for i in 1.0:3.0]))
end AG.addfeature(newlayer) do newfeature
AG.setfield!(newfeature, 0, 2)
AG.setfield!(newfeature, 1, "multilineline")
AG.setgeom!(newfeature, AG.createmultilinestring([[(i,i+1) for i in j:j+3] for j in 1.0:5.0:6.0]))
end
if withmissing
AG.addfeature(newlayer) do newfeature
AG.setfield!(newfeature, 0, 3)
AG.setfield!(newfeature, 1, "emptygeom")
end
end
end
end
return AG.read("mixed_geom")
end
get_mixed_geom_ds_with_shp_driver_and_built_from_scratch
julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
geom_names, field_names, featuredefn, fielddefns =
AG.schema_names(AG.layerdefn(layer))
ngeom = ArchGDAL.ngeom(featuredefn)
if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
else
geom_types = (AG.IGeometry{AG.wkbUnknown} for _ in 1:ngeom)
end
field_types =
(convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
return Tables.Schema(
(geom_names..., field_names...),
(geom_types..., field_types...),
)
end
julia> using DataFrames
julia> ds = get_mixed_geom_ds_with_shp_driver_and_built_from_scratch(;withmissing=false)
layer = AG.getlayer(ds, 0)
Layer: mixed_geom
Geometry 0 (): [wkbLineString], LINESTRING (1 2,2 3,3 4), ...
Field 0 (id): [OFTInteger64], 1, 2
Field 1 (name): [OFTString], line, multilineline
julia> df = DataFrame(layer)
2×3 DataFrame
Row │ id name
│ IGeometr… Int64 String
─────┼────────────────────────────────────────────
1 │ Geometry: wkbUnknown 1 line
2 │ Geometry: wkbUnknown 2 multilineline
julia> eltype(df[:, 1])
ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}
julia> df[1, 1]
Geometry: LINESTRING (1 2,2 3,3 4)
But it fails when the layer have missing geometries
julia> ds = get_mixed_geom_ds_with_shp_driver_and_built_from_scratch(;withmissing=true)
layer = AG.getlayer(ds, 0)
Layer: mixed_geom
Geometry 0 (): [wkbLineString], LINESTRING (1 2,2 3,3 4), ...
Field 0 (id): [OFTInteger64], 1, 2, 3
Field 1 (name): [OFTString], line, multilineline, emptygeom
julia> DataFrame(layer)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}
Closest candidates are:
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GML) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:48
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GeoJSON) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:45
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.WellKnownBinary) at /Users/Mathieu/.julia/packages/ArchGDAL/u1bzp/src/convert.jl:42
...
Stacktrace:
[1] setindex!(A::Vector{ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}}, x::Missing, i1::Int64)
@ Base ./array.jl:839
[2] add!
@ ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:127 [inlined]
[3] eachcolumns
@ ~/.julia/packages/Tables/gg6Id/src/utils.jl:111 [inlined]
[4] buildcolumns(schema::Tables.Schema{(Symbol(""), :id, :name), Tuple{ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}, Int64, String}}, rowitr::ArchGDAL.IFeatureLayer)
@ Tables ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:135
[5] columns
@ ~/.julia/packages/Tables/gg6Id/src/fallbacks.jl:253 [inlined]
[6] DataFrame(x::ArchGDAL.IFeatureLayer; copycols::Nothing)
@ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/other/tables.jl:58
[7] DataFrame(x::ArchGDAL.IFeatureLayer)
@ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/other/tables.jl:49
[8] top-level scope
@ REPL[74]:1
There is an easy solution consisting in returning nothing
in Tables.schema
. But the DataFrame geometry column's type will be Union{Missing, ArchGDAL.IGeometry}
julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
geom_names, field_names, featuredefn, fielddefns =
AG.schema_names(AG.layerdefn(layer))
ngeom = ArchGDAL.ngeom(featuredefn)
if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
else
return nothing
end
field_types =
(convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
return Tables.Schema(
(geom_names..., field_names...),
(geom_types..., field_types...),
)
end
julia> df = DataFrame(layer)
3×3 DataFrame
Row │ id name
│ IGeometr…? Int64 String
─────┼────────────────────────────────────────────────────
1 │ Geometry: wkbLineString 1 line
2 │ Geometry: wkbMultiLineString 2 multilineline
3 │ missing 3 emptygeom
julia> eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry}
julia> df[1, 1]
Geometry: LINESTRING (1 2,2 3,3 4)
Since the modifications introduced recently in ArchGDAL.jl seemed to have detailed geometry types, we can build the geometry column's type through Tables.schema
by extracting a Union
from all features' geometry types
julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
geom_names, field_names, featuredefn, fielddefns =
AG.schema_names(AG.layerdefn(layer))
ngeom = ArchGDAL.ngeom(featuredefn)
if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
else
geom_types_sets = Vector{Set{DataType}}(undef, ngeom)
fill!(geom_types_sets, Set{DataType}())
for j in 1:ngeom
for i in 0:ArchGDAL.nfeature(layer)-1
AG.getfeature(layer, i) do feature
feature_type::DataType = AG.IGeometry{AG.getgeomtype(AG.getgeom(feature))}
feature_type == AG.IGeometry{AG.wkbUnknown} && (feature_type = Missing)
push!(geom_types_sets[j], feature_type)
end
end
end
geom_types = (Union{gts...} for gts in geom_types_sets)
end
field_types =
(convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
return Tables.Schema(
(geom_names..., field_names...),
(geom_types..., field_types...),
)
end
julia> df = DataFrame(layer)
3×3 DataFrame
Row │ id name
│ Union…? Int64 String
─────┼────────────────────────────────────────────────────
1 │ Geometry: wkbLineString 1 line
2 │ Geometry: wkbMultiLineString 2 multilineline
3 │ missing 3 emptygeom
julia> eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}
julia> df[1, 1]
Geometry: LINESTRING (1 2,2 3,3 4)
Benchmarking with the road.shp
ESRI shapefile that can be found here (French motorways) we can see that Option 2 has a 20% performance overhead.
julia> road_ds = AG.read("road/road.shp")
road_layer = AG.getlayer(road_ds, 0)
Layer: road
Geometry 0 (): [wkbLineString], LINESTRING (874644.4...), ...
Field 0 (gid): [OFTInteger], 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
Field 1 (roadcode): [OFTString], 01A903905CD, 01A903905CD, 01A903905CD, ...
Field 2 (plod): [OFTString], DB1, DB2, DB3, DB4, DB1, DB2, DB1, DB2, ...
Field 3 (absd): [OFTReal], 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Field 4 (plof): [OFTString], FB1, FB2, FB3, FB4, FB1, FB2, FB1, FB2, ...
...
Number of Fields: 7
julia> using BenchmarkTools
julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
geom_names, field_names, featuredefn, fielddefns =
AG.schema_names(AG.layerdefn(layer))
ngeom = ArchGDAL.ngeom(featuredefn)
if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
else
return nothing
end
field_types =
(convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
return Tables.Schema(
(geom_names..., field_names...),
(geom_types..., field_types...),
)
end
julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 685.176 ms … 725.277 ms ┊ GC (min … max): 5.65% … 6.50%
Time (median): 706.511 ms ┊ GC (median): 6.55%
Time (mean ± σ): 705.000 ms ± 14.271 ms ┊ GC (mean ± σ): 6.46% ± 1.18%
▁ ▁ ▁ ▁ ▁ █ ▁
█▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
685 ms Histogram: frequency by time 725 ms <
Memory estimate: 237.74 MiB, allocs estimate: 3035032.
julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
geom_names, field_names, featuredefn, fielddefns =
AG.schema_names(AG.layerdefn(layer))
ngeom = ArchGDAL.ngeom(featuredefn)
if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
else
geom_types_sets = Vector{Set{DataType}}(undef, ngeom)
fill!(geom_types_sets, Set{DataType}())
for j in 1:ngeom
for i in 0:ArchGDAL.nfeature(layer)-1
AG.getfeature(layer, i) do feature
feature_type::DataType = AG.IGeometry{AG.getgeomtype(AG.getgeom(feature))}
feature_type == AG.IGeometry{AG.wkbUnknown} && (feature_type = Missing)
push!(geom_types_sets[j], feature_type)
end
end
end
geom_types = (Union{gts...} for gts in geom_types_sets)
end
field_types =
(convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
return Tables.Schema(
(geom_names..., field_names...),
(geom_types..., field_types...),
)
end
julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min … max): 816.748 ms … 843.214 ms ┊ GC (min … max): 6.17% … 8.40%
Time (median): 829.371 ms ┊ GC (median): 7.39%
Time (mean ± σ): 830.147 ms ± 9.908 ms ┊ GC (mean ± σ): 7.28% ± 0.99%
█ █ █ ██ ██
█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
817 ms Histogram: frequency by time 843 ms <
Memory estimate: 270.52 MiB, allocs estimate: 3250798.
Browsing DataFrames.jl code, I haven't been able to find the use of the Tables.Schema, so I don't know if there any side effects with option 1.
And I don't know the benefit of having the detailed type in the DataFrame geom column type, as with option 2.
If one of the above options suits you, I can make a Pull Request
I'm kind of leaning towards option 1 (of the three options in https://github.com/yeesian/ArchGDAL.jl/issues/223#issuecomment-888002541). @visr @evetion do you have any thoughts on the matter? In the case of a multi-type column, my hunch is that it isn't very helpful to dispatch on the column's type, and it isn't as important to be super precise about the column type for option 2 to be worth it.
Compliments for the research @mathieu17g ! 🥇
I'm not sure which option is best. I just like to automatically handle geometries, for example, make sure that all multipolygons are flattened to single polygons where applicable using dispatch. Neither option 1 (missing?) and 2 (union) will really help there, but option 2 seems the more correct of the two.
There is a way to get option 1's performance (+~5% overhead) and option 2's detailed geometry. It is to define Base.promote_rule
for each combination of IGeometry
subtypes. Promote rules could be implemented with a macro similar to one used to implement convert
functions.
Note: I don't see the benefit of implementing the Tables.schema
function. See at the end of my comment the benchmark for a cleaned road.shp, between master's Table.schema and a one returning nothing
With the same shapefile as used in previous comment:
julia> road_ds = AG.read("road/road.shp")
road_layer = AG.getlayer(road_ds, 0)
Layer: road
Geometry 0 (): [wkbLineString], LINESTRING (874644.4...), ...
Field 0 (gid): [OFTInteger], 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
Field 1 (roadcode): [OFTString], 01A903905CD, 01A903905CD, 01A903905CD, ...
Field 2 (plod): [OFTString], DB1, DB2, DB3, DB4, DB1, DB2, DB1, DB2, ...
Field 3 (absd): [OFTReal], 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Field 4 (plof): [OFTString], FB1, FB2, FB3, FB4, FB1, FB2, FB1, FB2, ...
...
Number of Fields: 7
julia> using BenchmarkTools
julia> df = DataFrame(road_layer)
eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry}
julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
Range (min … max): 600.909 ms … 618.806 ms ┊ GC (min … max): 6.83% … 7.10%
Time (median): 612.044 ms ┊ GC (median): 6.88%
Time (mean ± σ): 611.173 ms ± 6.829 ms ┊ GC (mean ± σ): 7.03% ± 0.97%
█ █ █ █ █ █ ███
█▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁███ ▁
601 ms Histogram: frequency by time 619 ms <
Memory estimate: 237.74 MiB, allocs estimate: 3035029.
julia> Base.promote_rule(::Type{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}, ::Type{ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}) = Union{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}
julia> df = DataFrame(road_layer)
eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}
julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 615.584 ms … 654.362 ms ┊ GC (min … max): 5.50% … 8.12%
Time (median): 641.108 ms ┊ GC (median): 6.62%
Time (mean ± σ): 639.500 ms ± 12.582 ms ┊ GC (mean ± σ): 7.00% ± 1.12%
█ █ █ █ █ █ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁█ ▁
616 ms Histogram: frequency by time 654 ms <
Memory estimate: 237.74 MiB, allocs estimate: 3035027.
julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
geom_names, field_names, featuredefn, fielddefns =
AG.schema_names(AG.layerdefn(layer))
ngeom = ArchGDAL.ngeom(featuredefn)
if AG.shortname(AG.getdriver(Base.getfield(layer, :ownedby))) != "ESRI Shapefile"
geom_types = (AG.IGeometry{AG.gettype(AG.getgeomdefn(AG.layerdefn(layer), i))} for i in 0:ngeom-1)
else
geom_types_sets = Vector{Set{DataType}}(undef, ngeom)
fill!(geom_types_sets, Set{DataType}())
for j in 1:ngeom
for i in 0:ArchGDAL.nfeature(layer)-1
AG.getfeature(layer, i) do feature
feature_type::DataType = AG.IGeometry{AG.getgeomtype(AG.getgeom(feature))}
feature_type == AG.IGeometry{AG.wkbUnknown} && (feature_type = Missing)
push!(geom_types_sets[j], feature_type)
end
end
end
geom_types = (Union{gts...} for gts in geom_types_sets)
end
field_types =
(convert(DataType, AG.gettype(fielddefn)) for fielddefn in fielddefns)
return Tables.Schema(
(geom_names..., field_names...),
(geom_types..., field_types...),
)
end
julia> df = DataFrame(road_layer)
eltype(df[:, 1])
Union{Missing, ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, ArchGDAL.IGeometry{ArchGDAL.wkbMultiLineString}}
julia> @benchmark DataFrame($road_layer)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min … max): 773.121 ms … 802.173 ms ┊ GC (min … max): 5.94% … 8.23%
Time (median): 787.602 ms ┊ GC (median): 7.57%
Time (mean ± σ): 788.342 ms ± 11.600 ms ┊ GC (mean ± σ): 7.83% ± 1.21%
█ █ █ █ █ █ █
█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁█ ▁
773 ms Histogram: frequency by time 802 ms <
Memory estimate: 270.52 MiB, allocs estimate: 3250798.
In the end, I don't see the benefit of implementing the Tables.schema
function.
I have checked the difference between building a Tables.Schema
for the shapefile road.shp stripped of features with wkbMultiLineString and missing geoms. It is not signifiant ( < σ )
julia> @benchmark DataFrame($layer)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
Range (min … max): 572.249 ms … 600.157 ms ┊ GC (min … max): 5.75% … 8.35%
Time (median): 588.713 ms ┊ GC (median): 7.96%
Time (mean ± σ): 587.146 ms ± 8.671 ms ┊ GC (mean ± σ): 7.58% ± 1.00%
█ █ █ █ █ █ █ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
572 ms Histogram: frequency by time 600 ms <
Memory estimate: 236.63 MiB, allocs estimate: 3028884.
julia> function Tables.schema(layer::AG.AbstractFeatureLayer)::Union{Tables.Schema, Nothing}
return nothing
end
julia> @benchmark DataFrame($layer)
BenchmarkTools.Trial: 9 samples with 1 evaluation.
Range (min … max): 578.874 ms … 602.971 ms ┊ GC (min … max): 6.94% … 7.42%
Time (median): 593.821 ms ┊ GC (median): 7.95%
Time (mean ± σ): 590.695 ms ± 9.017 ms ┊ GC (mean ± σ): 7.70% ± 1.03%
█ █ █ █ █ █ ██ █
█▁▁▁▁▁█▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁█ ▁
579 ms Histogram: frequency by time 603 ms <
Memory estimate: 236.00 MiB, allocs estimate: 3015267.
Thanks for the investigations @mathieu17g. I find it hard to really oversee this issue and its context. Let me try writing it out a bit.
The main related PR here is #160. Before this there was only Geometry
(like in GDAL) and this was reflected in Tables.schema
. In that PR we added an enum in the type parameter, Geometry{OGRwkbGeometryType}
, to be able to give more information for dispatch. I believe this type parameter is not currently used by ArchGDAL, but would be of interest to downstream packages or users.
Now this extra information seems to be leading to this issue, for shapefiles with mixed geometries. When using the Tables API to convert to a DataFrame, DataFrames will use Tables.schema
to prepare a Vector{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}
to hold all the geometries. This is a documented usecase of Tables.schema. The wkbLineString
here is reported by GDAL itself but does not seem fully correct as documented in the shapefile driver docs. So then when it loops over the features and encounters a wkbMultiLineString
and tries to put it in the geometry column it fails because the type does not match. In reality this should be no issue since they are all just Ptr{Cvoid}
anyway, through here and here.
If geometries have to really be converted, there is ArchGDAL.forceto, but that is not the case here.
Is Tables.Schema useful?
In this benchmark indeed it is not. I don't have an example where it does make a big difference. The Tables documentation mentions it is especially useful for constructing similar tables. In principle we have the information to provide a schema, unfortunately in the case of shapefile it just may be wrong. I think my favorite options are 1 (without promote_rule
), so give up on schema for shapefiles only. It is a bit unfortunate to have different behavior depending on the driver though, GDAL normally abstracts that. The other option is to consider reverting #160 altogether, though that would be more breaking, so we may want to avoid that.
I don't understand the value of defining promote_rule
to the union of the two types. Isn't it supposed to pick the more general type that can represent both? That would be MultiLineString in case of LineString and MultiLineString.
So then when it loops over the features and encounters a wkbMultiLineString and tries to put it in the geometry column it fails because the type does not match
It also fails for a feature with a missing geometry. It happens also for the "GeoJSON" driver. So the issue is not only about mixed LineString
and MultiLineString
or Polygon
and MultiPolygon
geometries allowed by the "ESRI Shapefile" driver with a featuredefn
type set to LineString
or MultiLineString
, the missing geometry case has to be handled.
julia> open("missing_geom.geojson", "w") do f
print(f,
"""{
"type": "FeatureCollection",
"name": "test_geoms",
"features": [
{ "type": "Feature", "properties": { "id": 1, "name": "line1" }, "geometry": { "type": "LineString", "coordinates": [ [ 1.0, 2.0 ], [ 2.0, 3.0 ], [ 3.0, 4.0 ] ] } },
{ "type": "Feature", "properties": { "id": 2, "name": "line2" }, "geometry": { "type": "LineString", "coordinates": [ [ 3.0, 4.0 ], [ 4.0, 5.0 ], [ 5.0, 6.0 ] ] } },
{ "type": "Feature", "properties": { "id": 3, "name": "emptygeom" }, "geometry": null }
]
} """)
end
print(read("missing_geom.geojson", String))
{
"type": "FeatureCollection",
"name": "test_geoms",
"features": [
{ "type": "Feature", "properties": { "id": 1, "name": "line1" }, "geometry": { "type": "LineString", "coordinates": [ [ 1.0, 2.0 ], [ 2.0, 3.0 ], [ 3.0, 4.0 ] ] } },
{ "type": "Feature", "properties": { "id": 2, "name": "line2" }, "geometry": { "type": "LineString", "coordinates": [ [ 3.0, 4.0 ], [ 4.0, 5.0 ], [ 5.0, 6.0 ] ] } },
{ "type": "Feature", "properties": { "id": 3, "name": "emptygeom" }, "geometry": null }
]
}
julia> ds = AG.read("missing_geom.geojson")
GDAL Dataset (Driver: GeoJSON/GeoJSON)
File(s):
missing_geom.geojson
Number of feature layers: 1
Layer 0: test_geoms (wkbLineString)
julia> layer = AG.getlayer(ds, 0)
Layer: test_geoms
Geometry 0 (): [wkbLineString], LINESTRING (1 2,2 3,3 4), ...
Field 0 (id): [OFTInteger], 1, 2, 3
Field 1 (name): [OFTString], line1, line2, emptygeom
julia> using DataFrames
julia> DataFrame(layer)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type ArchGDAL.IGeometry{ArchGDAL.wkbLineString}
Closest candidates are:
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GML) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:48
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.GeoJSON) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:45
convert(::Type{var"#s435"} where var"#s435"<:ArchGDAL.AbstractGeometry, ::GeoFormatTypes.WellKnownBinary) at /Users/Mathieu/.julia/packages/ArchGDAL/erkjx/src/convert.jl:42
...
Stacktrace:
[1] setindex!(A::Vector{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}}, x::Missing, i1::Int64)
@ Base ./array.jl:839
[2] add!
@ ~/.julia/packages/Tables/E99hk/src/fallbacks.jl:127 [inlined]
[3] eachcolumns
@ ~/.julia/packages/Tables/E99hk/src/utils.jl:111 [inlined]
[4] buildcolumns(schema::Tables.Schema{(Symbol(""), :id, :name), Tuple{ArchGDAL.IGeometry{ArchGDAL.wkbLineString}, Int32, String}}, rowitr::ArchGDAL.IFeatureLayer)
@ Tables ~/.julia/packages/Tables/E99hk/src/fallbacks.jl:135
[5] columns
@ ~/.julia/packages/Tables/E99hk/src/fallbacks.jl:253 [inlined]
[6] DataFrame(x::ArchGDAL.IFeatureLayer; copycols::Nothing)
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/other/tables.jl:58
[7] DataFrame(x::ArchGDAL.IFeatureLayer)
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/other/tables.jl:49
[8] top-level scope
@ REPL[8]:1
In version 0.6.0 there were no missing
value in the table. Instead getgeom
in function nextnamedtuple
was returning a IGeometry()
when GDAL.ogr_f_getgeometryref(feature.ptr) == C_NULL
, which is of IGeometry
type
In version 0.7.2 function Tables.getcolumn
returns missing
in case of missing geometry when geom.ptr == C_NULL
missing
values are useful according to me, but they have to be handled on the Tables interface
By the way, missing
values could also be set instead of ""
in getdefault for fields whose fielddefn does not have any default value set
In reality this should be no issue since they are all just Ptr{Cvoid} anyway, through here and here.
If geometries have to really be converted, there is ArchGDAL.forceto, but that is not the case here.
Definitely :-)
Is Tables.Schema useful?
In this benchmark indeed it is not. I don't have an example where it does make a big difference. The Tables documentation mentions it is especially useful for constructing similar tables.
I have not found this in Tables.jl, but here in ArchGDAL Tables.columns(::ArchGDAL.IFeatureLayer)
and Tables.rows(::ArchGDAL.IFeatureLayer)
are not implemented.
Instead we rely on Tables.jl/src/fallback.jl implementations Tables.columns(x::T) where {T}
and rows(x::T) where {T}
.
A user of ArchGDAL Tables interface will most probably call one of those two generated functions which returns, besides columns or rows, a Tables.Schema
generated if Tables.schema(::ArchGDAL.IFeatureLayer)
returns nothing
.
It is the case in DataFrames.jl there is no call to Tables.schema(::ArchGDAL.IFeatureLayer)
before calling Tables.columns(::ArchGDAL.IFeatureLayer)
in DataFrame(x::T; copycols::Union{Nothing, Bool}=nothing) where {T}
. Maybe there are other users using Tables.schema
before calling Tables.columns
or Tables.rows
. I will ask Tables.jl developers if we want to be thorough (https://github.com/JuliaData/Tables.jl/issues/245#issue-958855882)
If we assume that Tables.schema(::ArchGDAL.IFeatureLayer)
is not used by any user of ArchGDAL Tables interface, the only optimization that it would bring is between buildcolumns(::Nothing, rowitr::T) where {T}
and buildcolumns(schema, rowitr::T) where {T}
in Tables.jl/src/fallback.jl.
This leads to implementing on ArchGDAL.jl side, at least to handle the missing geometry case, something faster than buildcolumns(::Nothing, rowitr::T) where {T}
-> _buildcolumns
-> eachcolumns
/ add_or_widen!
-> __buildcolumns
. I doubt that it could be done easily.
I don't understand the value of defining promote_rule to the union of the two types. Isn't it supposed to pick the more general type that can represent both? That would be MultiLineString in case of LineString and MultiLineString.
I agree, it looks like a dirty hack
Conclusion: unless Tables.jl developers point to any more concrete use of Tables.schema(::AG.IFeatureLayer)
returning a Tables.Schema
, I think it is best to return nothing
all the time
Edit: Well I got some feedback from @quinnj (owner of Tables.jl) https://github.com/JuliaData/Tables.jl/issues/245#issuecomment-891814745
It seems that it may not be a bad idea to return nothing
in Tables.schema(::AG.IFeatureLayer)
.
A Tables.Schema
could be built within ArchGDAL.jl Table interface which may suits geometry handling needs in tabular data format better than the 'Tables.Schema' built in Tables.jl/src/fallback.jl. But since you are still discussing the best way to represent geometry objects within ArchGDAL.jl, it may worth postponing for later.
It seems that it may not be a bad idea to return nothing in Tables.schema(::AG.IFeatureLayer).
Indeed sounds like it. I'd support that. Would be nice to switch to this since now the tables interface is broken for all the situations you indicated (not just shapefile). @evetion @yeesian any objections?
I'd support returning nothing
in Tables.schema(::AG.IFeatureLayer)
(i.e. option 1 of https://github.com/yeesian/ArchGDAL.jl/issues/223#issuecomment-888002541) too.
@yeesian note that the option 1 you linked to only returns nothing
for shapefiles, whereas now the suggestion is to always return nothing
, i.e. don't provide a schema. This because there is not only an issue with shapefiles, but also missing data for instance. GDAL doesn't really seem to support a fully accurate schema, therefore it is probably better to opt out and let the data speak for itself.
Ah thanks for clarifying -- yeah it makes sense to return nothing
for all the drivers since that's needed for not failing in the case of missing data.
OK, I'm working on a PR with a few test cases testing from dataset creation to DataFrame. The existing test sets are already OK.
What about my proposition in https://github.com/yeesian/ArchGDAL.jl/issues/223#issuecomment-891589533
By the way, missing values could also be set instead of "" in getdefault for fields whose fielddefn does not have any default value set
Shall I make a second PR ?
Great, thanks for working on the PR!
By the way, missing values could also be set instead of "" in getdefault for fields whose fielddefn does not have any default value set
Sorry, I missed this. To me it indeed sounds better to return missing
instead of the empty string here.
The GDAL docs say the return value is "default field value or NULL". I don't see getdefault
being used in the tables interface. Is it supposed to be used there?
The GDAL docs say the return value is "default field value or NULL". I don't see getdefault being used in the tables interface. Is it supposed to be used there?
Tables.jl/src/fallback.jl buildcolumns
function with schema = nothing
, will push missing
to the Table
EDIT: Furthermore, it appears to be necessary with the GML driver to handle the NULL geometry case in geomname(::AbstractGeometry)
and return missing
. I will add it to the PR
In OGR the reading of an ESRI shapefile layer containing one geometry field instantiated with a mix of
wkbLineString
andwkbMultiLineString
leads to ageomdefn
of typewkbLineString
instead ofwkbUnknown
.This due to OGR ESRI shapefile driver :
With this kind of layer, with ArchGDAL.jl version 0.7.2, in tables.jl the geomtype inserted in the
Tables.Schema
of the Tables interface is theDataType[ArchGDAL.IGeometry{ArchGDAL.wkbLineString}]
Therefore the layer's conversion to a DataFrame raises a conversion error when handling the
wkbMultiLineString
features of the layer. Below an example:It used to work with version 0.6.0. The geomtype inserted in the Tables.Schema of the Tables interface was just
ArchGDAL.IGeometry
I have tried with the GeoJSON driver and ArchGDAL.jl version 0.7.2, there is no issues since the driver does not enforce the geom type of the layerdefn to
ArchGDAL.wkbLineString
after feature creations, and keeps it atArchGDAL.wkbUnknown
Maybe, in function
Tables.schema
of tables.jl, the type of each feature instance could be checked and in case of mixed geometry types the geometry type inTables.Schema
set toDataType[ArchGDAL.IGeometry{ArchGDAL.wkbUnknown}]
or aUnion
of all the mixed geometry types ?