yeesian / ArchGDAL.jl

A high level API for GDAL - Geospatial Data Abstraction Library
https://yeesian.github.io/ArchGDAL.jl/stable/
Other
141 stars 27 forks source link

[Breaking] Return `missing` if the field is set but null. #238

Closed yeesian closed 2 years ago

yeesian commented 3 years ago

Fixes #177 .

I think this aligns the behavior a lot better across the different file formats (see the test sets in test/test_tables.jl)

This behavior is implemented through getfield(feature, i) which makes use of getfieldtype(feature, i). That way, we are aligned in behavior for field subtypes even for e.g. displaying the fields.

Because a lot hinges on distinguishing between whether a field is set versus whether it is null (see https://github.com/Toblerity/Fiona/issues/460#issue-240013079), I have also added support for isfieldnull(), isfieldsetandnotnull(), and setfieldnull!().

visr commented 3 years ago

Nice work here! I wasn't aware of unset vs null in GDAL before. I believe null is much more common in practice?

I made this GeoJSONSeq file with both unset and null values:

{ "type": "Feature", "properties": { "a": 1, "b": null }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } }
{ "type": "Feature", "properties": {}, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } } 

If I run this through ogr with ogr2ogr -f GeoJSONSeq unset-gdal-translate.geojson unset.geojson, the result is the same, e.g. unset and null are both kept. If I import the file in QGIS however, it shows this:

image

And indeed if I export it from QGIS I get nulls back in field that were unset before:

{ "type": "Feature", "properties": { "a": 1, "b": null }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } }
{ "type": "Feature", "properties": { "a": null, "b": null }, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } } 

Probably it makes more sense to keep the distinction here just like ogr2ogr does, this being a GDAL wrapper. Though I do understand @evetion's point about making everything missing possibly being more user friendly.

evetion commented 3 years ago

@visr This is all about displaying a non-tabular (sparse) dataset like your geojson in a tabular interface. In this case, QGIS doesn't have a representation of unset, just NULL. Julia can be more expressive, although I wonder whether our Tables interface likes the mixing of missing and nothing.

visr commented 3 years ago

Yeah exactly. I can imagine it being less ideal in the Tables interface. Though these functions are also used outside of that, in the sparse setting, where it does make sense to keep the distinction.

I think the interface can technically handle both just fine, it is just not very conventional. Though I don't know how common unset is in practice, and it's relatively easy to convert nothing to missing, if it causes problems.

mathieu17g commented 3 years ago

@evetion and @visr, in the draft PR #243 , I'm about to try to handle it for a conversion from a table source to an IFeatureLayer.
I was just waiting for this PR #238 to be merged.

In the other way, since we rely on Tables.jl/src/fallbacks.jl buildcolumns(::Nothing, rowitr::T) where {T} which iterates through all features building column types by "widening" them with promote_type each time it encounters a new type in field values, it should work as follow if we have an FieldDefn of OGRFieldType OFTInteger64 and features with UNSET or NULL fields for this FieldDefn:

julia> T = Int64; @show T
       T = promote_type(T, typeof(nothing)); @show T
       T = promote_type(T, typeof(missing)); @show T
T = Int64
T = Union{Nothing, Int64}
T = Union{Missing, Nothing, Int64}
visr commented 3 years ago

@mathieu17g thanks for showing that indeed technically it works (#243 is a great addition by the way).

I think for the tables interface specifically, the concerns about nothing are:

Hence for tables I'm less enthousiastic about nothing than for just getfield. Though I'm not sure what's best either. If we go with nothing at worst we force people to explicitly convert to missing, which is not too bad of a price to pay to make the data perfectly round trippable and consistent with GDAL itself. Though if you know a way to optionally do the nothing to missing conversion during table construction instead of afterwards, potentially avoiding an expensive copy, it would be interesting as well.

yeesian commented 3 years ago

@evetion @visr is this PR good to go from your perspective? I'm okay with iterating further on feedback or for you to have more time to review if you'd like -- else I'll merge it to keep things moving for Mathieu per https://github.com/yeesian/ArchGDAL.jl/pull/238#issuecomment-927720530

mathieu17g commented 3 years ago

Hence for tables I'm less enthousiastic about nothing than for just getfield. Though I'm not sure what's best either. If we go with nothing at worst we force people to explicitly convert to missing, which is not too bad of a price to pay to make the data perfectly round trippable and consistent with GDAL itself.

Yes, and if we need to convert all nothing values to missing later, it won't be difficult

Though if you know a way to optionally do the nothing to missing conversion during table construction instead of afterwards, potentially avoiding an expensive copy, it would be interesting as well.

This should be easily done in getcolumn function which is called for each field and geometry value by Tables.jl/src/fallback.jl Tables.columns function For the Tables.rows, it can be done by implementing a slightly modified version of Tables.rows fallback function

function Tables.rows(x::AbstractFeatureLayer)
    cols = Tables.columns(x)
    return Tables.RowIterator(cols, Int(Tables.rowcount(cols)))
end

potentially avoiding an expensive copy

I didn't get what you have in mind. Tables.columns fallback constructor always make a copy.
I had not considered the option of not making a copy. Maybe it would allow to modify a dataset from a table (e.g. DataFrame). Brilliant idea! It would probably require to convert field values in the table to something like Field{OGRFieldType, OGRFieldSubtype} and fully map GDAL objects interdependence in ArchGDAL composite types

visr commented 3 years ago

@mathieu17g we can discuss copies and such in a different issue. I still wanted to create one, because I saw that converting a layer to DataFrame for a large vector file took minutes.

yeesian commented 2 years ago

@mathieu17g we can discuss copies and such in a different issue. I still wanted to create one, because I saw that converting a layer to DataFrame for a large vector file took minutes.

I'm interested in this too! Feel free to open an issue for it or have the discussion in #243

yeesian commented 2 years ago

I'll be merging this PR after #245 is merged if I don't see any further objections by then.

yeesian commented 2 years ago

Seems like #245 is turning out to be more complicated than expected, and we're still figuring it out in https://github.com/JuliaGeo/GDAL.jl/pull/124.

In the meantime, there are no remaining issues to resolve here, so I'll merge this PR to unblock #243.