[Feature request] parse SQL to generate list of columns

z3z1ma / dbt-osmosis

Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools

https://z3z1ma.github.io/dbt-osmosis/

Apache License 2.0

422 stars 46 forks source link

[Feature request] parse SQL to generate list of columns #152

Open olhmr opened 1 month ago

olhmr commented 1 month ago

In a similar vein to https://github.com/z3z1ma/dbt-osmosis/issues/57, I'm looking for ways to run this without needing to connect to the data warehouse.

My understanding of the current implementation is that we either pull columns from the warehouse, or rely on them being manually generated ahead of time. What I would like to see additionally is an option to parse the compiled SQL, derive the names of the columns outputted by the model, and scaffold those into the yaml files - with appropriate upstream descriptions if applicable.

Presumably this would be added somewhere here, and controlled through a CLI flag. We can require a manifest, similar to the catalog option, to avoid parsing dbt during runtime.

This would enable fully self-contained generation and propagation of documentation, without the need for large manual data entry or connections to external systems.

I'm happy to take a stab at this if you think it makes sense to include?

z3z1ma commented 2 weeks ago

Thats a cool idea. Implementation wise, its not an easy task. Its fraught with edge cases. Star expansions, lexical scoping, etc. Not to mention dbt SQL is not valid sql since it has jinja markup so we cannot generate an AST. The best bet for making this successful would be to use the compiled SQL. Compilation can be done without a connection to the DWH I think? Unless there are introspective queries ofc. So maybe we start there and do a best effort?

kokorin commented 1 week ago

There is built-in DBT macro get_columns_in_query

But I think the main problem would be dealing with model dependencies. If we have 2 models A and B. A as select 1 as id and B as select * from {{ ref('A') }}. A has to be ran before deducing columns and types using get_columns_in_query. So after every change to A we have to run it to update model B's columns.

The only alternative to that is tweaking materialization of every upstream model to ephemeral and then using get_columns_in_query.

I think the only reliable approach is to use new DBT 1.8 feature --empty to build changed models and all downstream models and then running dbt-osmosis.

z3z1ma commented 1 week ago

In a similar vein to https://github.com/z3z1ma/dbt-osmosis/issues/57, I'm looking for ways to run this without needing to connect to the data warehouse.

get_columns_in_query defeats the purpose since it needs to connect to the DWH and also we use the same mechanism under the hood basically.