z3z1ma / dbt-osmosis

Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools
https://z3z1ma.github.io/dbt-osmosis/
Apache License 2.0
422 stars 45 forks source link

Add metadata to connect source models to "staging" models to enable more robust inheritance #52

Closed jaredx435k2d0 closed 3 months ago

jaredx435k2d0 commented 1 year ago

Given the following, when I run yaml refactor, the description from the source table addresses doesn't bubble up to the stg_some_schema__addresses staging model _stg_some_schema.yml (even though the description from the ID field does):

_this_db__some_schema__sources.yml:

version: 2

sources:
  - name: this_db__some_schema__source
    database: this_db
    schema: some_schema
    tables:
      - name: addresses
        description: 'this table has addresses'
        columns:
          - name: ID
            description: 'this is the ID'

_stg_some_schema.yml:

version: 2

models:
  - name: stg_some_source__addresses
    columns:
      - name: ID
        description: this is the ID

stg_some_schema__addresses.sql:

SELECT
    ID
,   NAME
,   ADDRESS_LINE1
,   ADDRESS_LINE2
,   CITY
,   DISTRICT_OR_COUNTY
,   STATE_CODE
,   COUNTRY_CODE
,   POSTAL_CODE
,   CREATED_AT
,   UPDATED_AT
,   _FIVETRAN_SYNCED

FROM
    {{ source("this_db__some_schema__source", "addresses") }}

I'm doing one .yml file per source, and one .yml file per corresponding staging schema

Python 3.10.10 dbt-core==1.4.5 dbt-extractor==0.4.1 dbt-osmosis==0.11.14 dbt-snowflake==1.4.2

z3z1ma commented 1 year ago

@jaredx435k2d0 we do not propagate source table descriptions.

There can be 100 tables that use some subset of columns from table_A, we would not want to propagate a description from table_A to all 100 clearly different tables. Table descriptions do not propagate. Now a good question lies around the fuzzy boundary between source -> staging models since, typically by virtue of "best practices", they are 1:1. Still we cannot rely on that without some connective tissue, Some metadata.

jaredx435k2d0 commented 1 year ago

Agreed on all points!

Now a good question lies around the fuzzy boundary between source -> staging models since, typically by virtue of "best practices", they are 1:1 This specifically would be something I'd expect as a user. That said, I hear you when you say you can't rely on a given project following these practices.

jaredx435k2d0 commented 1 year ago

Just to be clear, when yaml document says it

Automatically generate documentation based on upstream documented columns

, where / when exactly does this apply? Is it infinite from model to model (excluding, as you've said, sources)?

z3z1ma commented 1 year ago

@jaredx435k2d0 So, its specifically for columns, and what it does exactly is this:

  1. Iterate through models
  2. For each model, traverse the depends_on edges recursively and build a graph
  3. Traverse the graph in reverse and build a dictionary where keys are column names and values are metadata like descriptions
  4. Check if any columns in the current model are not documented and fill in the documentation with the upstream knowledge
  5. Update state (YAML file on disk/in memory Manifest) and continue to next node
z3z1ma commented 1 year ago

The noteworthy details of this are that column names must be matched 1:1 and if you have models A->B->C->D and model A & C have a column called order_number. When evaluating D, the definition from C will have precedence, as it will overwrite the key in the knowledge dict. When evaluating B, it obviously won't use info from C since its a different fork in the DAG. The effect is pretty natural inheritance. The real opportunity lies in leveraging LLMs to fill in the blanks.

z3z1ma commented 3 months ago

Closing this issue. We support camel case <-> snake case & case insensitive understanding implicitly. Outside of that, we should have a more specific issue if there is a pain here.