z3z1ma / dbt-osmosis

Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools
https://z3z1ma.github.io/dbt-osmosis/
Apache License 2.0
422 stars 45 forks source link

Allow reading from catalog for document #60

Closed b-per closed 1 year ago

b-per commented 1 year ago

Fixes #57

First shot at adding support for providing a catalog file as input.

Things that you might want me to change:

z3z1ma commented 1 year ago

Here is some case sensitivity leaking into our inheritance. https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L728

More inconsistency where we use lower (which is fine since it is on the left and right side of the comparison but not semantically correct per db) https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L774

Main comparisons remove_columns_not_in_database(...) https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L932 update_undocumented_columns_with_prior_knowledge(...) https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L957 https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L962

z3z1ma commented 1 year ago

@b-per

I am thinking we merge your branch as-is and I will make a follow-up issue/PR with my notes above to fix up column comparison.

A second follow up issue will be on preferring the catalog path. The obvious consideration is that you would not want to pursue the on-the-fly catalog path if you are running a subset of models. So maybe we consider the # of queued models based on the selection criteria and decide then. (this is independent of the "no introspection" catalog usage you have added here 😄 and more on our general approach in an introspective run)

LMK if you have any thoughts or objections.

chasegiles-built commented 1 year ago

@z3z1ma Here is another spot where there is a case sensitive comparison.

Should we just apply self.column_casing(col) to yaml_columns_ordered to avoid adding the case logic to all of the comparisons?

z3z1ma commented 1 year ago

@chasegiles-built I think we want a dedicated Column class that all columns are marshaled into. From there we can define a custom __eq__ method which accounts for the target.type (IE the adapter). In this way we can preserve whatever casing the user is using but compare columns in sets / loops / c in cols / etc without mutating what we have in memory as the source of truth for whatever the user has. I can tackle that as soon as I merge this.

# The advantage to this approach other than being very simple is that it should
# work with sets, lists, in / not in, and so on without having to call some special
# casing func everywhere or needing more list comprehensions. Also the `name`
# property will respect whatever casing was originally given to it outside of comparison
# which means we don't mutate users YAML to all uppercase (since it isnt required by dbt either and may be user preference)

class Column:
    def __init__(self, name: str, db_type: str | Enum):
        self.name = name
        self.db_type = db_type

    def __str__(self):
        return self.name

    def __repr__(self):
        return f"Column(name={self.name}, db_type={self.db_type})"

    def __eq__(self, other: Column):
        if self.db_type == types.Snowflake:
            return self.name.upper() == other.name.upper()
        return self.name == other.name

    def __hash__(self): ... 

EDIT: what you proposed might work too but I think it might be more clear with the above from a future extensibility/hackability perspective

b-per commented 1 year ago

While the handling of different casing definitely needs to be enhanced, I agree with you that it can be done in a separate issue. It shouldn't be too much work to then refactor what I added.

If you are ok with it, I can add the new flag to the refactor function and we could then consider this one closed and merge it.

On the topic of getting the columns for each model or using the adapter capability to run dbt docs generate, in my mind, it would depend on the number of models that need to be documented. dbt docs generate can start taking some time on larger projects and we might not want to run it all to just document a handful of models.

z3z1ma commented 1 year ago

@b-per

Agree on both. There is a sweet number somewhere before the docs generate would be faster. I will try to find a sane default and otherwise expose it as an env var for user modification in a follow-up issue.

Once you add to refactor, I will merge.

b-per commented 1 year ago

Done, I have added the config for refactor 😃