z3z1ma / dbt-osmosis

Provides automated YAML management, a dbt server, streamlit workbench, and git-integrated dbt model output diff tools
https://z3z1ma.github.io/dbt-osmosis/
Apache License 2.0
422 stars 45 forks source link

dbt-osmosis yaml refactor is removing nested values #45

Closed awoehrl closed 1 year ago

awoehrl commented 1 year ago

Hi,

I have some arrays in my BigQuery source table that are documented in my source.yml file as columns with dot notation. When I run dbt-osmosis yaml refactor, these fields are removed unfortunately.

Example yaml:

columns:
          - name: images
            description: "Array with image urls and caption"
          - name: images.url
            description: "The url of the image"
          - name: images.caption
            description: "The caption of the image"

Output from dbt-osmosis yaml refactor:

INFO     🔧 Removing column images.caption from dbt schema                                                                                                                                                                                                    
INFO     🔧 Removing column images.url from dbt schema   

Cheers Andreas

z3z1ma commented 1 year ago

I think I have an idea of how I may be able to take care of this 🤔 probably using the method builtin the the bigquery adapter ColumnInfo subclass.

z3z1ma commented 1 year ago

This is now resolved as of latest version 0.11.6 @awoehrl 🎉

image
volderette commented 1 year ago

Wow, thank you so much! It's working great!

yu-iskw commented 4 months ago

@z3z1ma @awoehrl dbt-osmosis today enables us to generate nested fields including arrays. As you may know, ARRAY data type of BigQuery is a bit tricky, because it express a combination of the mode and data type. For instance, an array of string in BigQuery is expressed by the REPEATED mode and the STRING data type. Currently, dbt-osmosis import only the data type of an array. That is, a data_type property to an array of string is just STRING. The information of the array data type is dropped. How can we handle the information of array columns for BigQuery in dbt-osmosis?

Query to create a dbt source

SELECT
  1 AS test_integer,
  "X" AS test_string,
  [0, 1, 2, 3, 4] AS test_array,
  STRUCT(
    1 AS nested_integer,
    [0, 1, 2, 3] AS nested_array,
    STRUCT(
        1 AS nested_integer,
        [0, 1, 2, 3] AS nested_array
    ) AS nested_struct
  ) AS test_struct

Generated YAML by dbt-osmosis

      - name: test_source
        columns:
          - name: test_integer
            description: ''
            data_type: INT64
          - name: test_string
            description: ''
            data_type: STRING
          - name: test_array
            description: ''
            data_type: INT64
          - name: test_struct
            description: ''
            data_type: RECORD
          - name: test_struct.nested_integer
            description: ''
            data_type: INT64
          - name: test_struct.nested_array
            description: ''
            data_type: INT64
          - name: test_struct.nested_struct.nested_integer
            description: ''
            data_type: INT64
          - name: test_struct.nested_struct.nested_array
            description: ''
            data_type: INT64