Feature Request: Snowflake: Send queries in batches for significant performance gain

z3z1ma / dbt-osmosis

Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools

https://z3z1ma.github.io/dbt-osmosis/

Apache License 2.0

422 stars 45 forks source link

Feature Request: Snowflake: Send queries in batches for significant performance gain #39

Closed jaredx435k2d0 closed 1 year ago

jaredx435k2d0 commented 1 year ago

Related to #38

It seems like running yaml refactor sends the DESCRIBE TABLE ... statements to Snowflake sequentially one-by-one as it goes. It'd be great if this went a lot faster.

Seems like it also executes show terse objects in [db_name].[schema_name] every single time it runs describe table, when it seems like this could maybe just be executed one time upfront?

Describe the solution you'd like Would it be reasonable to queue up all those database statements up front and run through them as the results return, so that it completes much more quickly.

Describe alternatives you've considered Can't really think of any apart from just using it as it is and waiting much longer.

Additional context python 3.9.16 macOS 13.2 (22D49) Snowflake 7.3.1 dbt-core==1.3.2 dbt-extractor==0.4.1 dbt-osmosis==0.9.8 dbt-snowflake==1.3.0

jaredx435k2d0 commented 1 year ago

One helpful example of this is working with the Fivetran Connector for Salesforce.

The resulting schema has 15,879 columns across 759 tables. Running yaml refactor for just that one schema takes around two hours (using an X-Large Snowflake warehouse with nothing else running on it, executing from an M1 MacBook Pro with little else running on it).

I'd personally want to be able to run this on a regular basis to ensure our repo and documentation stays up to date, but would like to have it run much more quickly so that the resources of the machine it's running on (an EC2 instance, for example) aren't consumed for so long doing this.

z3z1ma commented 1 year ago

@jaredx435k2d0 I have now taken a first stab at parallelizing the yaml stuff. I would expect a massive speed increase.

This show terse objects in [db_name].[schema_name] is super interesting because that must be an implementation detail of the underlying snowflake adapter -- or perhaps just how we are using it.

I am closing this issue because queries should be sent in parallel but we can re-open a more specific issue for the show terse objects which will require a peek at where it is called in dbt-snowflake and where I am calling that caller.

jaredx435k2d0 commented 1 year ago

@z3z1ma So much faster! Love it. THANK you.

I looked in my query history and still see 1 DESCRIBE TABLE statement per table. I also see several repeated show terse objects in [db_name].[schema_name] statements for the same schema (when I'm only running it for one schema).

I'd anticipate that, while these queries are fast, they're probably still slowing things down considerably (and incurring extra Snowflake costs) due to having to do one round trip to Snowflake per table.

Python 3.10.10 dbt-core==1.4.5 dbt-extractor==0.4.1 dbt-osmosis==0.11.14 dbt-snowflake==1.4.2

z3z1ma commented 1 year ago

@jaredx435k2d0 This is where the DESCRIBE TABLE is being called. https://github.com/dbt-labs/dbt-snowflake/blob/main/dbt/include/snowflake/macros/adapters.sql#L112

Its outside my realm of control and inside the actual dbt adapter internals. One thing that is unique about dbt-osmosis for what its worth is that it is compatible with all existing (and future) dbt-adapters since it uses the adapter protocol exclusively which is a required implementation for all adapters. This means there are database specific optimizations I could make but it significantly steps up the complexity even if it sounds simple and I am not sure that is the right path just to eek out some more performance. After a full run, which could be done in CI, you can scope your runs to specific folder paths to make it faster if have a ton of models. Another potential optimization I could foresee is storing a cache of filechecksums so we dont re-run for an unmodified dependency chain. We would need those checksums either on disk or in the dwh.