A lightweight, pydantic centric library for validating GeoParquet files (or PyArrow Tables) and converting between GeoJSON and GeoParquet...without GDAL!
Motivation: This project started at the 2024 San Fransisco GeoParquet Community hackathon, and arose out of a simple observation: why must Python users install the massive GDAL dependency (typically via GeoPandas) to do simple GeoJSON<>GeoParquet conversions.
Is this library the right choice for you?:
gpq
,
which is written in Go and substantially faster.Note: All user-exposed functions and schema classes are available at the top level (i.e., geoparquet_pydantic.validate_geoparquet_table(...)
) of this library.
pydantic
SchemasGeometryColumnMetadata
: A pydantic
model that validates a
geometry column's (aka primary_column
) metadata. This is nested within the following schema.GeoParquetMetadata
: A pydantic
model for the metadata assigned to the "geo" key in a pyarrow.Table
that allows it to be read by GeoParquet readers once saved.For an explanation of these schemas, please refence the geoparquet repository.
Convenience functions that simply uses GeoParquetMetadata
to return a bool
depending on whether the GeoParquet metadata obeys the schema.
pyarrow.Table
's GeoParquet metadata:def validate_geoparquet_table(
table: pyarrow.Table,
primary_column: Optional[str] = None,
) -> bool:
"""Validates a the GeoParquet metadata of a pyarrow.Table.
Args:
table (pyarrow.Table): The table to validate.
primary_column (Optional[str], optional): The name of the primary geometry column.
Defaults to None.
Returns:
bool: True if the metadata is valid, False otherwise.
"""
...
def validate_geoparquet_file(
geoparquet_file: str | Path | pyarrow.parquet.ParquetFile,
primary_column: Optional[str] = None,
read_file_kwargs: Optional[dict] = None,
) -> bool:
"""Validates that a parquet file has correct GeoParquet metadata without opening it.
Args:
geoparquet_file (str | Path | ParquetFile): The file to validate.
primary_column (str, optional): The primary column name. Defaults to 'geometry'.
read_file_kwargs (dict, optional): Kwargs to be passed into pyarrow.parquet.ParquetFile().
See: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow-parquet-parquetfile
Returns:
bool: True if the metadata is valid, False otherwise.
"""
...
geojson_pydantic.FeatureCollection
to a GeoParquet pyarrow.Table
def geojson_to_geoparquet(
geojson: FeatureCollection | Path,
primary_column: Optional[str] = None,
column_schema: Optional[pyarrow.Schema] = None,
add_none_values: Optional[bool] = False,
geo_metadata: GeoParquetMetadata | dict | None = None,
**kwargs,
) -> pyarrow.Table:
"""Converts a GeoJSON Pydantic FeatureCollection to an Arrow table with geoparquet
metadata.
To save to a file, simply use pyarrow.parquet.write_table() on the returned table.
Args:
geojson (FeatureCollection): The GeoJSON Pydantic FeatureCollection.
primary_column (str, optional): The name of the primary column. Defaults to None.
column_schema (pyarrow.Schema, optional): The Arrow schema for the table. Defaults to None.
add_none_values (bool, default=False): Whether to fill missing column values
specified in param:column_schema with 'None' (converts to pyarrow.null()).
geo_metadata (GeoParquet | dict | None, optional): The GeoParquet metadata.
**kwargs: Additional keyword arguments for the Arrow table writer.
Returns:
The Arrow table with GeoParquet metadata.
"""
...
pyarrow.Table
or file to a geojson_pydantic.FeatureCollection
def geoparquet_to_geojson(
geoparquet: pyarrow.Table | str | Path,
primary_column: Optional[str] = None,
max_chunksize: Optional[int] = None,
max_workers: Optional[int] = None,
) -> FeatureCollection:
"""Converts an Arrow table with GeoParquet metadata to a GeoJSON Pydantic
FeatureCollection.
Args:
geoparquet (pyarrow.Table): Either an Arrow.Table or parquet with GeoParquet metadata.
primary_column (str, optional): The name of the primary column. Defaults to 'geometry'.
max_chunksize (int, optional): The maximum chunksize to read from the parquet file. Defaults to 1000.
max_workers (int, optional): The maximum number of workers to use for parallel processing.
Defaults to 0 (runs sequentially). Use -1 for all available cores.
Returns:
FeatureCollection: The GeoJSON Pydantic FeatureCollection.
"""
...
Install from PyPi:
pip install geoparquet-pydantic
Or from source:
$ git clone https://github.com/xaviernogueira/geoparquet-pydantic.git
$ cd geoparquet-pydantic
$ pip install .
Then import with an underscore:
import geoparquet_pydantic
Or just import the functions/classes you need from the top-level:
from geoparquet_pydantic import (
GeometryColumnMetadata,
GeoParquetMetadata,
validate_geoparquet_table,
validate_geoparquet_file,
geojson_to_geoparquet,
geoparquet_to_geojson,
)
click
.geoparquet_pydantic.geoparquet_to_geojson()
.We encourage contributions, feature requests, and bug reports!
Here is our recomended workflow:
dev-requirements.txt
to install our development dependencies.pyright
as a linter.pre-commit run --all-file
before commiting your work.Happy coding!