z3z1ma / dbt-osmosis

Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools
https://z3z1ma.github.io/dbt-osmosis/
Apache License 2.0
452 stars 45 forks source link

[Feature Request] Determine `osmosis_progenitor` accurately at the same distance #172

Open yamamoto-yuta opened 1 month ago

yamamoto-yuta commented 1 month ago

Overview

When using the --add-progenitor-to-meta option, the resulting osmosis_progenitor: values can sometimes be incorrect.

Reproduction Steps

While I am currently investigating the exact scenarios that cause this issue, one example is provided below.

Consider the following set of models in a lineage graph (the letters A ~ C next to fct_item_shops denote the order of JOINs).

classDiagram
    class raw_shops {
        shop_id
        item_key
    }

    class raw_items {
        item_key
        item_code
    }

    class raw_item_shops {
        item_code
        shop_id
    }

    class stg_shops {
        shop_id
        item_key
    }

    class stg_items {
        item_key
        item_code
    }

    class stg_item_shops {
        item_code
        shop_id
    }

    class fct_item_shops {
        C.item_key
        A.item_code
        A.shop_id
    }

    %% 

    raw_shops --> stg_shops
    raw_items --> stg_items
    raw_item_shops --> stg_item_shops
    stg_shops --> fct_item_shops: LEFT JOIN shops B <br /> USING (shop_id) 
    stg_items --> fct_item_shops: LEFT JOIN items C  <br /> USING (item_code)
    stg_item_shops --> fct_item_shops: shop_items A

You can view the actual code in the following repository:

https://github.com/yamamoto-yuta/dbt-osmosis-inheritance-check

In this case, the propagation sources for each column in fct_item_shops should be as follows:

However, the actual result is as follows, where the source of item_code is incorrectly identified as raw_items instead of raw_shop_items.

dbt-osmosis Execution Result ```yaml version: 2 models: - name: stg_shops columns: - name: shop_id description: '' data_type: INT64 meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_shops - name: item_key description: '' data_type: STRING meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_shops - name: stg_item_shops columns: - name: item_code description: '' data_type: STRING meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_item_shops - name: shop_id description: '' data_type: INT64 meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_item_shops - name: stg_items columns: - name: item_key description: '' data_type: STRING meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items - name: item_code description: '' data_type: STRING meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items - name: fct_item_shops columns: - name: item_key description: '' meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items data_type: STRING - name: item_code description: '' meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_items data_type: STRING - name: shop_id description: '' meta: osmosis_progenitor: source.my_dbt_project.tmp_dbt_osmosis_test.raw_shops data_type: INT64 ```

Execution Environment

❯ dbt-osmosis --version
dbt-osmosis, version 0.13.2
❯ dbt --version
Core:
  - installed: 1.8.3
  - latest:    1.8.3 - Up-to-date!

Plugins:
  - bigquery: 1.8.2 - Up-to-date!
yamamoto-yuta commented 1 month ago

I was informed by @syou6162 that this is not a bug but rather an intended behavior.

dbt-osmosis determines the propagation source based on the distance of nodes, and in the case of the same distance, it cannot make an exact determination.

https://github.com/z3z1ma/dbt-osmosis/blob/53e5a8fb49bd5dabc638b4cf8de0485e4439f94a/src/dbt_osmosis/core/column_level_knowledge_propagator.py#L17-L40

Based on that, I have changed the issue title from a bug to a feature request.