xaviernogueira / geoparquet-pydantic

Validate and convert GeoJSON<>GeoParquet...without GDAL!
MIT License
32 stars 3 forks source link
geojson geoparquet geospatial-data lightweight pydantic

GeoParquet-Pydantic

Logo

A lightweight, pydantic centric library for validating GeoParquet files (or PyArrow Tables) and converting between GeoJSON and GeoParquet...without GDAL!

Pre-Commit Tests Coverage Package version License


Motivation: This project started at the 2024 San Fransisco GeoParquet Community hackathon, and arose out of a simple observation: why must Python users install the massive GDAL dependency (typically via GeoPandas) to do simple GeoJSON<>GeoParquet conversions.

Is this library the right choice for you?:

Note: All user-exposed functions and schema classes are available at the top level (i.e., geoparquet_pydantic.validate_geoparquet_table(...)) of this library.

Features

pydantic Schemas

For an explanation of these schemas, please refence the geoparquet repository.

Validation functions

Convenience functions that simply uses GeoParquetMetadata to return a bool depending on whether the GeoParquet metadata obeys the schema.

Validate a pyarrow.Table's GeoParquet metadata:

def validate_geoparquet_table(
    table: pyarrow.Table,
    primary_column: Optional[str] = None,
) -> bool:
  """Validates a the GeoParquet metadata of a pyarrow.Table.

    Args:
        table (pyarrow.Table): The table to validate.
        primary_column (Optional[str], optional): The name of the primary geometry column.
            Defaults to None.

    Returns:
        bool: True if the metadata is valid, False otherwise.
    """
    ...

Validate a Parquet file's GeoParquet metadata:

def validate_geoparquet_file(
    geoparquet_file: str | Path | pyarrow.parquet.ParquetFile,
    primary_column: Optional[str] = None,
    read_file_kwargs: Optional[dict] = None,
) -> bool:
    """Validates that a parquet file has correct GeoParquet metadata without opening it.

    Args:
        geoparquet_file (str | Path | ParquetFile): The file to validate.
        primary_column (str, optional): The primary column name. Defaults to 'geometry'.
        read_file_kwargs (dict, optional): Kwargs to be passed into pyarrow.parquet.ParquetFile().
            See: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow-parquet-parquetfile

    Returns:
        bool: True if the metadata is valid, False otherwise.
    """
    ...

Conversion functions

Convert from geojson_pydantic.FeatureCollection to a GeoParquet pyarrow.Table

def geojson_to_geoparquet(
    geojson: FeatureCollection | Path,
    primary_column: Optional[str] = None,
    column_schema: Optional[pyarrow.Schema] = None,
    add_none_values: Optional[bool] = False,
    geo_metadata: GeoParquetMetadata | dict | None = None,
    **kwargs,
) -> pyarrow.Table:
    """Converts a GeoJSON Pydantic FeatureCollection to an Arrow table with geoparquet
    metadata.

    To save to a file, simply use pyarrow.parquet.write_table() on the returned table.

    Args:
        geojson (FeatureCollection): The GeoJSON Pydantic FeatureCollection.
        primary_column (str, optional): The name of the primary column. Defaults to None.
        column_schema (pyarrow.Schema, optional): The Arrow schema for the table. Defaults to None.
        add_none_values (bool, default=False): Whether to fill missing column values
            specified in param:column_schema with 'None' (converts to pyarrow.null()).
        geo_metadata (GeoParquet | dict | None, optional): The GeoParquet metadata.
        **kwargs: Additional keyword arguments for the Arrow table writer.

    Returns:
        The Arrow table with GeoParquet metadata.
    """
    ...

Convert from a GeoParquet pyarrow.Table or file to a geojson_pydantic.FeatureCollection

def geoparquet_to_geojson(
    geoparquet: pyarrow.Table | str | Path,
    primary_column: Optional[str] = None,
    max_chunksize: Optional[int] = None,
    max_workers: Optional[int] = None,
) -> FeatureCollection:
    """Converts an Arrow table with GeoParquet metadata to a GeoJSON Pydantic
    FeatureCollection.

    Args:
        geoparquet (pyarrow.Table): Either an Arrow.Table or parquet with GeoParquet metadata.
        primary_column (str, optional): The name of the primary column. Defaults to 'geometry'.
        max_chunksize (int, optional): The maximum chunksize to read from the parquet file. Defaults to 1000.
        max_workers (int, optional): The maximum number of workers to use for parallel processing.
            Defaults to 0 (runs sequentially). Use -1 for all available cores.

    Returns:
        FeatureCollection: The GeoJSON Pydantic FeatureCollection.
    """
    ...

Getting Started

Install from PyPi:

pip install geoparquet-pydantic

Or from source:

$ git clone https://github.com/xaviernogueira/geoparquet-pydantic.git
$ cd geoparquet-pydantic
$ pip install .

Then import with an underscore:

import geoparquet_pydantic

Or just import the functions/classes you need from the top-level:

from geoparquet_pydantic import (
  GeometryColumnMetadata,
  GeoParquetMetadata,
  validate_geoparquet_table,
  validate_geoparquet_file,
  geojson_to_geoparquet,
  geoparquet_to_geojson,
)

Roadmap

Contribute

We encourage contributions, feature requests, and bug reports!

Here is our recomended workflow:

Happy coding!