Closed b-per closed 1 year ago
Here is some case sensitivity leaking into our inheritance. https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L728
More inconsistency where we use lower (which is fine since it is on the left and right side of the comparison but not semantically correct per db) https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L774
Main comparisons
remove_columns_not_in_database(...)
https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L932
update_undocumented_columns_with_prior_knowledge(...)
https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L957
https://github.com/z3z1ma/dbt-osmosis/blob/main/src/dbt_osmosis/core/osmosis.py#L962
@b-per
I am thinking we merge your branch as-is and I will make a follow-up issue/PR with my notes above to fix up column comparison.
A second follow up issue will be on preferring the catalog path. The obvious consideration is that you would not want to pursue the on-the-fly catalog path if you are running a subset of models. So maybe we consider the # of queued models based on the selection criteria and decide then. (this is independent of the "no introspection" catalog usage you have added here 😄 and more on our general approach in an introspective run)
LMK if you have any thoughts or objections.
@z3z1ma Here is another spot where there is a case sensitive comparison.
Should we just apply self.column_casing(col)
to yaml_columns_ordered to avoid adding the case logic to all of the comparisons?
@chasegiles-built
I think we want a dedicated Column
class that all columns are marshaled into. From there we can define a custom __eq__
method which accounts for the target.type
(IE the adapter). In this way we can preserve whatever casing the user is using but compare columns in sets / loops / c in cols / etc without mutating what we have in memory as the source of truth for whatever the user has. I can tackle that as soon as I merge this.
# The advantage to this approach other than being very simple is that it should
# work with sets, lists, in / not in, and so on without having to call some special
# casing func everywhere or needing more list comprehensions. Also the `name`
# property will respect whatever casing was originally given to it outside of comparison
# which means we don't mutate users YAML to all uppercase (since it isnt required by dbt either and may be user preference)
class Column:
def __init__(self, name: str, db_type: str | Enum):
self.name = name
self.db_type = db_type
def __str__(self):
return self.name
def __repr__(self):
return f"Column(name={self.name}, db_type={self.db_type})"
def __eq__(self, other: Column):
if self.db_type == types.Snowflake:
return self.name.upper() == other.name.upper()
return self.name == other.name
def __hash__(self): ...
EDIT: what you proposed might work too but I think it might be more clear with the above from a future extensibility/hackability perspective
While the handling of different casing definitely needs to be enhanced, I agree with you that it can be done in a separate issue. It shouldn't be too much work to then refactor what I added.
If you are ok with it, I can add the new flag to the refactor
function and we could then consider this one closed and merge it.
On the topic of getting the columns for each model or using the adapter capability to run dbt docs generate
, in my mind, it would depend on the number of models that need to be documented. dbt docs generate
can start taking some time on larger projects and we might not want to run it all to just document a handful of models.
@b-per
Agree on both. There is a sweet number somewhere before the docs generate
would be faster. I will try to find a sane default and otherwise expose it as an env var for user modification in a follow-up issue.
Once you add to refactor
, I will merge.
Done, I have added the config for refactor
😃
Fixes #57
First shot at adding support for providing a catalog file as input.
Things that you might want me to change:
col.lower()
- I had to use it to make the code work with my own project on Snowflake where the YML files are defined with lowercase columns but are stored as uppercase incatalog.json
--catalog-file
or how we want users to be able to pick between offline (catalog) vs online (connecting to the warehouse)parts
to match the function parameter and catalog file