worldbank / DECAT_Space2Stats

https://worldbank.github.io/DECAT_Space2Stats/
Other
1 stars 2 forks source link

Ingest via ADBC conn #67

Closed alukach closed 1 month ago

alukach commented 1 month ago

This PR makes use of Arrow Database Connectivity to ingest data into the database.

> space2stats-ingest load postgresql://username:password@localhost:5439/postgis --parquet-file ./local.parquet

Additionally, we can also load data directly from S3. This is somewhat slower than loading from the local filesystem.

> space2stats-ingest load postgresql://username:password@localhost:5439/postgis --parquet-file s3://wbg-geography01/Space2Stats/parquet/GLOBAL/combined_population.parquet

Before these changes, I was seeing the following ingestion times:

alukach commented 1 month ago

Looks like tests are failing, will get to that tomorrow.

kylebarron commented 1 month ago

Before

  • load: 03:05:00

After

  • load: 00:06:04

That seems like a big improvement!

zacharyDez commented 1 month ago

@kylebarron, huge improvement! I quickly researched what's causing it to be so slow with a Pandas df. Making a more digestible write-up of why Pyarrow is so much faster might be interesting.

kylebarron commented 1 month ago

This is why arrow exists! It's going to be a lot faster than pandas. I think @bitner might like this result too

bitner commented 1 month ago

I suspect that the difference is not really due to arrow here. I believe that pandas to_sql loads data by using INSERT statements behind the scenes whereas the adbc connection is using COPY. They are both using libpq for the actual database communication, but using COPY is wayyyyy faster than INSERT, particularly if the INSERTS were separate statements rather than the multi value single statement INSERT.