Open olhmr opened 1 month ago
Thats a cool idea. Implementation wise, its not an easy task. Its fraught with edge cases. Star expansions, lexical scoping, etc. Not to mention dbt SQL is not valid sql since it has jinja markup so we cannot generate an AST. The best bet for making this successful would be to use the compiled SQL. Compilation can be done without a connection to the DWH I think? Unless there are introspective queries ofc. So maybe we start there and do a best effort?
There is built-in DBT macro get_columns_in_query
But I think the main problem would be dealing with model dependencies. If we have 2 models A and B.
A as select 1 as id
and B as select * from {{ ref('A') }}
. A has to be ran before deducing columns and types using get_columns_in_query
. So after every change to A we have to run it to update model B's columns.
The only alternative to that is tweaking materialization of every upstream model to ephemeral
and then using get_columns_in_query
.
I think the only reliable approach is to use new DBT 1.8 feature --empty
to build changed models and all downstream models and then running dbt-osmosis
.
In a similar vein to https://github.com/z3z1ma/dbt-osmosis/issues/57, I'm looking for ways to run this without needing to connect to the data warehouse.
get_columns_in_query
defeats the purpose since it needs to connect to the DWH and also we use the same mechanism under the hood basically.
In a similar vein to https://github.com/z3z1ma/dbt-osmosis/issues/57, I'm looking for ways to run this without needing to connect to the data warehouse.
My understanding of the current implementation is that we either pull columns from the warehouse, or rely on them being manually generated ahead of time. What I would like to see additionally is an option to parse the compiled SQL, derive the names of the columns outputted by the model, and scaffold those into the yaml files - with appropriate upstream descriptions if applicable.
Presumably this would be added somewhere here, and controlled through a CLI flag. We can require a manifest, similar to the catalog option, to avoid parsing dbt during runtime.
This would enable fully self-contained generation and propagation of documentation, without the need for large manual data entry or connections to external systems.
I'm happy to take a stab at this if you think it makes sense to include?