worldbank / DECAT_Space2Stats

https://worldbank.github.io/DECAT_Space2Stats/
Other
1 stars 2 forks source link

Feature/update table #86

Open zacharyDez opened 1 week ago

zacharyDez commented 1 week ago

What I Changed

  1. Added Update Table Logic: Implemented a workflow that enables updating an existing PostgreSQL table with new data from a Parquet file. The workflow involves:

    • Creating a temporary table from the Parquet file data.
    • Using PostgreSQL’s ALTER TABLE to add any new columns that aren’t already present in the main table.
    • Performing an UPDATE operation that synchronizes columns between the temporary and main tables based on a matching hex_id column.
    • This process minimizes network overhead by only transferring new columns and rows from the Parquet file, improving efficiency.
  2. Error Handling for Column Addition: Incorporated logic to revert new columns in the main table if the update process fails, ensuring data consistency and preventing unintended schema changes.

  3. Column Verification:

Introduced checks in verify_columns to ensure the hex_id column exists in the incoming Parquet file, as it is essential for matching records in the update operation.

How to Test It

  1. Run Unit Tests:

    • The test suite now includes unit tests in test_ingest.py to cover:
    • Basic ingestion of data when the table does not exist.
    • Update operations with new columns.
    • Behavior when columns already exist in the base table.
    • Ensuring that the hex_id column is mandatory.
    • Rollback behavior if the update fails mid-operation.
  2. Manual Verification:

    • The following steps describe how to manually test the update process by ingesting two different datasets into the database:
    • Spin up database with docker:
      docker-compose up
    • Download the initial dataset:
      aws s3 cp s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet .
      download: s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet to ./space2stats.parquet
    • Upload initial dataset:
      space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_population_2020.json space2stats.parquet
    • Generate the second dataset:
      python space2stats_ingest/METADATA/generate_test_data.py 
    • Upload the second dataset:
      space2stats-ingest <connection_string> ./space2stats_ingest/METADATA/stac/space2stats/space2stats_population_2020/space2stats_reupload_test.json space2stats_test.parquet 

Other Notes