z3z1ma / target-bigquery

target-bigquery is a Singer target for BigQuery. It supports storage write, GCS, streaming, and batch load methods. Built with the Meltano SDK.
MIT License
28 stars 38 forks source link

Schema evolution error when `denormalized` and `snake_case` is enabled #66

Open qgab-flowdesk opened 1 year ago

qgab-flowdesk commented 1 year ago

I'm running into issues to load data with the denormalised parameter enabled. Looking into the code, it seems that even when the table does not yet exists in BigQuery, we run into the following error:

google.api_core.exceptions.BadRequest: 400 PATCH https://bigquery.googleapis.com/bigquery/v2/projects/project-id/datasets/salesforce/tables/campaign?prettyPrint=false: Field id already exists in schema

The table schema is properly defined in BigQuery, but when it comes to load the actual data it looks like the table schema if not resolved correctly. The issue persists with both version 1 and 2 of the schema_resolver. Maybe there is an issue with the schema_translator?

If someone has Salesforce credentials, here is the corresponding Meltano configuration for reproducing the issue:

    plugins:

      extractors:
      - name: tap-salesforce
        config:
          api_type: REST
          start_date: '2023-06-01T00:00:00Z'
        select:
        - Campaign.*

      loaders:
      - name: target-bigquery
        variant: z3z1ma
        config:
          project: project-id
          dataset: salesforce
          method: batch_job
          location: us-west1
          batch_size: 500
          fail_fast: True
          append: False
          overwrite: True
          flattening_enabled: True
          denormalized: True
          schema_resolver_version: 2
          column_name_transforms:
            lower: True
            quote: False
            add_underscore_when_invalid: True
            snake_case: True

And the meltano commands:

meltano --log-level=debug invoke tap-salesforce > output.json
cat output.json | meltano --log-level=debug invoke target-bigquery
qgab-flowdesk commented 1 year ago

Alright so it looks like in my specific case the issue is caused because the mut_schema being a concatenation of non-transformed columns + metadata columns + transformed columns. To generalise, that issue will happen when both denormalized = True and snake_case = True in situations where tap output contains any camel case single-world column name. What happens is that we get the non transformed table schema, then check in the transformed table schema if any field is new: if input is camel case, all transformed fields are new.

For double word column names, an underscore is introduced so we'll just end up having a duplicated column (e.,g. StartDate -> start_date). For any single world column though, BigQuery will consider the fields appears twice in the schema provided to update the table (e.g. Id -> id).

That's when we end up with the following error, regardless of the sink since the Denormalized.update_schema method overwrites the update_schema implementation of all of the 4 sinks: google.api_core.exceptions.BadRequest: 400 PATCH https://bigquery.googleapis.com/bigquery/v2/projects/project/datasets/salesforce/tables/campaign?prettyPrint=false: Field id already exists in schema

alexcartwright commented 1 year ago

I am experiencing the same issue.

The Denormalized class calls

self.table.as_table()

Would a solution be to replace this line with:

table = self.table._table or table = self.table.as_table(self.apply_transforms)

The former is the actual scheme as requested from bigquery table, the second is the internal representation of the table applying appropriate transforms.