Closed alukach closed 1 month ago
Looks like tests are failing, will get to that tomorrow.
Before
- load: 03:05:00
After
- load: 00:06:04
That seems like a big improvement!
@kylebarron, huge improvement! I quickly researched what's causing it to be so slow with a Pandas df. Making a more digestible write-up of why Pyarrow is so much faster might be interesting.
This is why arrow exists! It's going to be a lot faster than pandas. I think @bitner might like this result too
I suspect that the difference is not really due to arrow here. I believe that pandas to_sql
loads data by using INSERT statements behind the scenes whereas the adbc connection is using COPY. They are both using libpq for the actual database communication, but using COPY is wayyyyy faster than INSERT, particularly if the INSERTS were separate statements rather than the multi value single statement INSERT.
This PR makes use of Arrow Database Connectivity to ingest data into the database.
Additionally, we can also load data directly from S3. This is somewhat slower than loading from the local filesystem.
Before these changes, I was seeing the following ingestion times: