Added Update Table Logic:
Implemented a workflow that enables updating an existing PostgreSQL table with new data from a Parquet file. The workflow involves:
Creating a temporary table from the Parquet file data.
Using PostgreSQL’s ALTER TABLE to add any new columns that aren’t already present in the main table.
Performing an UPDATE operation that synchronizes columns between the temporary and main tables based on a matching hex_id column.
This process minimizes network overhead by only transferring new columns and rows from the Parquet file, improving efficiency.
Error Handling for Column Addition:
Incorporated logic to revert new columns in the main table if the update process fails, ensuring data consistency and preventing unintended schema changes.
Column Verification:
Introduced checks in verify_columns to ensure the hex_id column exists in the incoming Parquet file, as it is essential for matching records in the update operation.
How to Test It
Run Unit Tests:
The test suite now includes unit tests in test_ingest.py to cover:
Basic ingestion of data when the table does not exist.
Update operations with new columns.
Behavior when columns already exist in the base table.
Ensuring that the hex_id column is mandatory.
Rollback behavior if the update fails mid-operation.
Manual Verification:
The following steps describe how to manually test the update process by ingesting two different datasets into the database:
Spin up database with docker:
docker-compose up
Download the initial dataset:
aws s3 cp s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet .
download: s3://wbg-geography01/Space2Stats/parquet/GLOBAL/space2stats.parquet to ./space2stats.parquet
Database-Specific Update Tuning: The performance of this update process is highly dependent on the database configuration and environment. Different configurations across machines and network setups can significantly impact ingestion and update performance.
Remote Development Database: We set up a remote development database to limit the impacts of database tuning and streamline testing. This setup would simplify development by allowing everyone to work with the same database configuration, reducing the need for local database setup and tuning.
What I Changed
Added Update Table Logic: Implemented a workflow that enables updating an existing PostgreSQL table with new data from a Parquet file. The workflow involves:
Error Handling for Column Addition: Incorporated logic to revert new columns in the main table if the update process fails, ensuring data consistency and preventing unintended schema changes.
Column Verification:
Introduced checks in verify_columns to ensure the hex_id column exists in the incoming Parquet file, as it is essential for matching records in the update operation.
How to Test It
Run Unit Tests:
Manual Verification:
Other Notes